I wouldn't say that reproducibility is as trivial as you say. The whole nix ecosystem was created because Eelco Dolstra's thesis[1] showed that even to reproduce software reliably, which is to say nothing of the data, requires cryptographic naming conventions and a functionally pure (not side-effecting, compositional) build system, which is extremely hard to do. I've talked to some highly accomplished people who say that the idea of nix is beautiful and pure, but not workable in practice. As far as I understand it, it's the reason why Docker is the preferred model over nix.
For a more timely and relevant example, there's huggingface's model reproduction pipeline - they just keep a frozen python script that builds and trains the model. Bugs are not allowed to be fixed, and it relies on everything that was available at time t0 always being available under the same name, which is not always so. No matter how hard you try, someone will fix some problem, and replace the old broken thing with some new thing of the same name. That's often a good idea, but it flies in the face of any claim that reproducibility is trivial.
And then there's data... In theory, data is just as hard or just as easy to replicate as software - they're both just digital artifacts. But. You'll never have to worry about HIPAA or PII or deanonymization problems with pure software, but those can be problems with data.
That said, I think the talk of reproducibility being trivial is a distraction from the more interesting point of this article, which really only gets one paragraph - that failure to replicate is interesting in itself. I was hoping for a whole article on that topic, which I think you would have had a lot more interesting thoughts on, that I wanted to hear!
This is such a thought-provoking comment. First, I agree that I need to write a whole post on why lack of replicability is an unintuitive good. I’ve added it to my queue (I am afraid I’m never going to get out of Lecture 8).
Second and relatedly, I think you understate how interesting your comment is. You are 100% right that total software reproducibility is impossible. BUT if a scientific result depends on the minute details of which version of libc the compiler used, that probably indicates an issue with the scientific finding. Floating point operations are non-associative. If your scientific result rests on them being precisely associative, that points to an interesting lack of robustness that’s worth examining in the derivation chain. But now this becomes a question of replication again. I’ll have to blog about this in-depth after I think about it more.
I agree in general that results should be robust, because if nothing else that would make them more replicable. Though, if I understand your thesis above, sometimes interesting results are not so robust, and not so replicable, and that's ok too because many interesting phenomena are just hard to observe properly, or else perhaps because we're using the wrong language to express them, making us ask questions with fragile premises.
Here's an example that may or may not be related to the same thing - My adviser once told me that some people he knew were having a very hard time getting a computational result to come out right. IIRC the problem was that some algorithm wasn't converging when it should. They'd checked their proofs and they were sure of it, and the result was fairly strong, so they persisted. In the end, they traced it all the way down to a bug in the linux kernel, and once that was fixed, voila.
There are several points here - for one, sometimes a result really does depend on fiddly floating point stuff, especially where numerical stability is at stake. I don't suppose we'll ever get away from fiddly numerical stuff as long as we use fixed precision numerics. For another, I guess I'm beating a dead horse now, that the category of "reproducible deterministic phenomena" is a deceptive one because while technically it is always possible to reproduce something that happens in a computer, the necessary and sufficient conditions are extreme - you have to reproduce *every*. *single*. *possible*. *thing*. or else the result is in jeopardy. In short, it's not just a matter of "doing better" or "trying harder", but of adopting a different paradigm of experimental computing in the first place, of which a necessary but not sufficient component would be nix-like software reproducibility.
Lastly, I have a friend looking over my shoulder who tells me that the issue with libc is not that you'd get a different result, but that your code would just refuse to run. He also recommends this post from the Guix project (they aim to make the nix idea easier to use).
I wouldn't say that reproducibility is as trivial as you say. The whole nix ecosystem was created because Eelco Dolstra's thesis[1] showed that even to reproduce software reliably, which is to say nothing of the data, requires cryptographic naming conventions and a functionally pure (not side-effecting, compositional) build system, which is extremely hard to do. I've talked to some highly accomplished people who say that the idea of nix is beautiful and pure, but not workable in practice. As far as I understand it, it's the reason why Docker is the preferred model over nix.
For a more timely and relevant example, there's huggingface's model reproduction pipeline - they just keep a frozen python script that builds and trains the model. Bugs are not allowed to be fixed, and it relies on everything that was available at time t0 always being available under the same name, which is not always so. No matter how hard you try, someone will fix some problem, and replace the old broken thing with some new thing of the same name. That's often a good idea, but it flies in the face of any claim that reproducibility is trivial.
And then there's data... In theory, data is just as hard or just as easy to replicate as software - they're both just digital artifacts. But. You'll never have to worry about HIPAA or PII or deanonymization problems with pure software, but those can be problems with data.
That said, I think the talk of reproducibility being trivial is a distraction from the more interesting point of this article, which really only gets one paragraph - that failure to replicate is interesting in itself. I was hoping for a whole article on that topic, which I think you would have had a lot more interesting thoughts on, that I wanted to hear!
[1] https://www.semanticscholar.org/paper/The-purely-functional-software-deployment-model-Dolstra/7c9d53d567c4db2034d8019ff11e0eb623fe2142
See also "Problems With Existing Solutions" at
https://jonathanlorimer.dev/posts/nix-thesis.html
This is such a thought-provoking comment. First, I agree that I need to write a whole post on why lack of replicability is an unintuitive good. I’ve added it to my queue (I am afraid I’m never going to get out of Lecture 8).
Second and relatedly, I think you understate how interesting your comment is. You are 100% right that total software reproducibility is impossible. BUT if a scientific result depends on the minute details of which version of libc the compiler used, that probably indicates an issue with the scientific finding. Floating point operations are non-associative. If your scientific result rests on them being precisely associative, that points to an interesting lack of robustness that’s worth examining in the derivation chain. But now this becomes a question of replication again. I’ll have to blog about this in-depth after I think about it more.
Thanks for commenting!
I agree in general that results should be robust, because if nothing else that would make them more replicable. Though, if I understand your thesis above, sometimes interesting results are not so robust, and not so replicable, and that's ok too because many interesting phenomena are just hard to observe properly, or else perhaps because we're using the wrong language to express them, making us ask questions with fragile premises.
Here's an example that may or may not be related to the same thing - My adviser once told me that some people he knew were having a very hard time getting a computational result to come out right. IIRC the problem was that some algorithm wasn't converging when it should. They'd checked their proofs and they were sure of it, and the result was fairly strong, so they persisted. In the end, they traced it all the way down to a bug in the linux kernel, and once that was fixed, voila.
There are several points here - for one, sometimes a result really does depend on fiddly floating point stuff, especially where numerical stability is at stake. I don't suppose we'll ever get away from fiddly numerical stuff as long as we use fixed precision numerics. For another, I guess I'm beating a dead horse now, that the category of "reproducible deterministic phenomena" is a deceptive one because while technically it is always possible to reproduce something that happens in a computer, the necessary and sufficient conditions are extreme - you have to reproduce *every*. *single*. *possible*. *thing*. or else the result is in jeopardy. In short, it's not just a matter of "doing better" or "trying harder", but of adopting a different paradigm of experimental computing in the first place, of which a necessary but not sufficient component would be nix-like software reproducibility.
Lastly, I have a friend looking over my shoulder who tells me that the issue with libc is not that you'd get a different result, but that your code would just refuse to run. He also recommends this post from the Guix project (they aim to make the nix idea easier to use).
https://hpc.guix.info/blog/2024/03/adventures-on-the-quest-for-long-term-reproducible-deployment/
The link to Nosek et al is broken. This one (https://www.annualreviews.org/content/journals/10.1146/annurev-psych-020821-114157) takes us to the correct page.
Thanks. Fixed.
> Call me crazy, but some parts of the world can’t be mathematicized or sciencified. I’ll expand on this argument in a future post.
Looking forward to this! Especially curious about your thoughts wrt decision making.