Just expressing my confirmation bias on two points
* My immediate reaction to LLMs was: this makes mediocrity easy. That is, for tasks where I am expert, such as writing an article, the LLM did nothing. For tasks where my expertise was minimal, like writing Python code, it was great. So, I believe the result
* If you only have 16 subjects, you'd better have results so clear that no statistical analysis is needed to convince the readers. For example: all 8 jumpers who used a parachute survived, 0 without a parachute did. Once error bars are relevant, I'm in the sceptical camp.
Not to horrible standard errors guy the horrible standard errors guy, but...
> The ideal experiment satisfies SUTVA, the Stable Unit Treatment Value Assumption, which asserts that the only thing that affects a measured outcome is the assignment to the treatment group or the control group.
I think it's more accurate to say that SUTVA asserts that the potential outcomes for unit i are not affected by unit j's assignment to treatment or control. It sounds like the problems with this study are violations of excludability more than SUTVA (or maybe both are violated).
For example, if the developers' completion times increase over time as they get more tired, this is a violation of excludability. By allowing the developers to choose the order of the tasks, factors other than the treatment itself can cause differences in completion times.
On the other hand, if the condition assignment of Task A affects the completion time of Task B (or vice versa), this would be a violation of SUTVA. It's easy to imagine this happening here, as well. For example, suppose there is a lot of overlap in the skills and knowledge required by these two tasks. If the first of the two is done with an LLM helper, this could give the developer some useful reminders and context for the second task, which might then be completed more quickly than it would have been otherwise. Or vice versa - if the developer does the first without the LLM, maybe it's distracting to get a bunch of unnecessary code-completion suggestions.
Hmmm... the limitations of Substack comment formatting make it difficult to use the notation I want to use to express my understanding of SUTVA, which is that Y_i(Z, D) = Y_i(Z_i, D_i), where the Z and D without subscripts are vectors representing all of the assignments and treatments across units. It's typed more clearly with bold typeface in slide 4 here:
Is the definition you gave an assumption of homogeneity of the treatment effect? By "Y_i(Z) are the same", do you mean that the potential outcomes don't vary across units (within treatment or control)?
Right, that's the same definition I'm using. That definition is taken from Angrist, Imbens, and Rubin where they are studying LATE. You can define SUTVA in the same way without the D, which would be the definition I'm using.
It isn't clear to me that all the data indicates that developers fail to correctly allow for a fold increase in actual developer time vs. estimated time. Perhaps they should have used Montgomery Scott's dictum, "Always double the estimated time to repair the Enterprise than you report to your captain. That maintains your reputation as a miracle worker!"
Then there is the classic: The last N% of coding on a project takes (100-N)% of the time.
Just expressing my confirmation bias on two points
* My immediate reaction to LLMs was: this makes mediocrity easy. That is, for tasks where I am expert, such as writing an article, the LLM did nothing. For tasks where my expertise was minimal, like writing Python code, it was great. So, I believe the result
* If you only have 16 subjects, you'd better have results so clear that no statistical analysis is needed to convince the readers. For example: all 8 jumpers who used a parachute survived, 0 without a parachute did. Once error bars are relevant, I'm in the sceptical camp.
Not to horrible standard errors guy the horrible standard errors guy, but...
> The ideal experiment satisfies SUTVA, the Stable Unit Treatment Value Assumption, which asserts that the only thing that affects a measured outcome is the assignment to the treatment group or the control group.
I think it's more accurate to say that SUTVA asserts that the potential outcomes for unit i are not affected by unit j's assignment to treatment or control. It sounds like the problems with this study are violations of excludability more than SUTVA (or maybe both are violated).
For example, if the developers' completion times increase over time as they get more tired, this is a violation of excludability. By allowing the developers to choose the order of the tasks, factors other than the treatment itself can cause differences in completion times.
On the other hand, if the condition assignment of Task A affects the completion time of Task B (or vice versa), this would be a violation of SUTVA. It's easy to imagine this happening here, as well. For example, suppose there is a lot of overlap in the skills and knowledge required by these two tasks. If the first of the two is done with an LLM helper, this could give the developer some useful reminders and context for the second task, which might then be completed more quickly than it would have been otherwise. Or vice versa - if the developer does the first without the LLM, maybe it's distracting to get a bunch of unnecessary code-completion suggestions.
I think this is just a matter of what one wants from their nitpicking.
My abstract version of SUTVA is this:
- there is an assignment vector Z in {0,1}^N where N is the number of units.
- there are the completion times for each unit Y_i(Z) (i in {1,...,n}) (the times are a function of the full randomized assignment).
SUTVA says that for all Z with Z_i = 1, Y_i(Z) are the same, and for all Z with Z_i=0, Y_i(Z) are the same.
Since the authors don't explicitly account for the order of task completion, I think it's fair to assess the order flexibility as a SUTVA violation.
Now, if they had explicitly considered the ordering variable, then we could make the exclusion assumption:
Y_i(Z,order) = Y_i(Z, order')
for all orders.
I guess this is just to say that when you're the hater in the backrow, there are plenty of stones you can lob at the seminar speaker.
Hmmm... the limitations of Substack comment formatting make it difficult to use the notation I want to use to express my understanding of SUTVA, which is that Y_i(Z, D) = Y_i(Z_i, D_i), where the Z and D without subscripts are vectors representing all of the assignments and treatments across units. It's typed more clearly with bold typeface in slide 4 here:
https://community.lawschool.cornell.edu/wp-content/uploads/2020/12/Green-presentation-on-SUTVA-for-CELS.pdf
Is the definition you gave an assumption of homogeneity of the treatment effect? By "Y_i(Z) are the same", do you mean that the potential outcomes don't vary across units (within treatment or control)?
Right, that's the same definition I'm using. That definition is taken from Angrist, Imbens, and Rubin where they are studying LATE. You can define SUTVA in the same way without the D, which would be the definition I'm using.
It isn't clear to me that all the data indicates that developers fail to correctly allow for a fold increase in actual developer time vs. estimated time. Perhaps they should have used Montgomery Scott's dictum, "Always double the estimated time to repair the Enterprise than you report to your captain. That maintains your reputation as a miracle worker!"
Then there is the classic: The last N% of coding on a project takes (100-N)% of the time.
Thanks for debunking the hype!