Colin Champion, 1 Mar 2022
intro :
Condorcet discussed an example in which there were 3 candidates and 81 voters. He noted that the Borda count elected a candidate – B – who did not satisfy the Condorcet criterion, which prefers A. He concluded that the Borda count was based on false reasoning.
Donald Saari resurrected Condorcet’s example, showing that A does not have the property of cancelling properly which is alone possessed B, and concluded that Condorcet methods were based on false reasoning.^{[1]}
These are two different opinions as to who is the correct winner of a well-defined election. The interesting question, to my mind, is not whether A or B is the rightful winner, but what makes him so: this is the semantic question. Without an answer to it, the topic is built on sand.
Saari’s procedure turns philosophical logic on its head: he determines truth by applying logical relations instead of validating the relations against a concept of truth. His method follows Arrow’s, and Arrow’s follows American constitutional history, in which the requirement for proportionality in the House of Representatives has been derailed by arguments derived from monotonicity criteria.
I can see 3 possible approaches to voting semantics. The first, which I will call formalistic, descends from Arrow. It holds that the ballots are the ultimate reality, making no reference to anything beyond themselves, so that voting truth lies in some logico-arithmetic property of the counts. To the best of my knowledge no actual semantics have been proposed in this family, which is more an implicit rejection of alternatives than a positive proposal.
The most basic non-formalist semantics might be termed Tarskian, or free-variable, semantics. This makes reference to external facts in the simplest way. Consider a jury model (otherwise known as a ‘valence model’) in which 9 voters vote A>B and 1 votes B>A. It would be possible to say that a majority vote was right so long as A was in fact the better candidate. This was possibly in Condorcet’s mind; it’s a view I used to hold some sympathy for, but it doesn’t seem to capture the truth of the matter. The correctness of an electoral decision should lie in whether it makes the best use of the evidence available to it, not in whether it ‘lucks out’ on hidden facts.
My preference is for ‘bound-variable’ semantics; for instance we might say that a voting method elects the right candidate under a jury model if it elects the candidate who is likeliest to be best, or under a spatial model if it elects the candidate whose sum of distances from voters is most likely to be least. (These definitions are illustrative, and can be modified.) I suspect that bound-variable semantics are closer to Condorcet’s view. External facts are now relevant, but they are quantified or integrated over, thereby avoiding making the truth of electoral decisions depend on contingent facts unavailable to them.
Bound-variable semantics need to be based on a model of the electoral process, and since more than one model is available, they must offer alternative meanings. It may easily happen that an election has different rightful winners under different semantics. The Borda count is likely to coincide with the truth under a jury model while Condorcet methods are likelier to do so under a spatial model.
Further bound-variable semantics are implicit in evaluations conducted by Robert F. Bordley in 1983 and by Warren D. Smith in 2000. The former is closely related to a jury model while the latter is sui generis, but neither is of great interest and I do not discuss them further.
The natural technique for binding the variables is Bayesian marginalisation, and if we make use of it we may talk of bound-variable semantics as ‘Bayesian’.
Voting semantics can be put to work in connection with the logical voting criteria. The golden precept of philosophical logic is that inference rules must be validated against the semantics; and given the propensity of the criteria to run into contradiction, this need can nowhere be more imperative.
Perhaps the simplest example is the unanimity criterion, which states that if every voter ranks A first, then A wins. This criterion cannot be justified under a free-variable semantics for a jury model, since there’s nothing logically impossible in supposing that an electorate unanimously makes the same mistake. This doesn’t do the criterion justice, and seems more an argument against the semantics than against the criterion.
Adopting a bound-variable semantics and a spatial model, it is obvious that if A is closer to every voter than any other candidate, then the average distance of voters from A is less than the average distance from any other candidate; hence the unanimity criterion is valid. Under a jury model a similar conclusion is obtained by elementary probability theory, so – unless some other semantic model needs to be taken into account – we may say that the criterion is valid outright.
Unanimity doesn’t get us very far. It is more useful to consider the participation and monotonicity criteria, and initially I will concentrate on a jury model. Monotonicity is little more than a double application of the participation criterion in this case.
Consider the participation criterion for a jury model with 2 candidates, A and B. It will be valid under the proposed semantics if, whenever A is the rightful winner for a set of n–1 ballots, he will remain the rightful winner if an n^{th} ballot is cast as A>B.
Write the candidate valences as v_{A}=x and v_{B}=y. Assume that voter j ranks the candidates i in decreasing order of v_{i} – ε_{i j} where ε_{i j} is a Gaussianly distributed noise term. Write δ = x–y and suppose that we have a posterior distribution on δ obtained from considering n–1 ballots. Then the n^{th} ballot will be cast as A>B if ε_{An} – ε_{Bn} < δ. The probability of this happening – ie. the probability of the observed ballot – is an increasing function of δ; and therefore the Bayes factor in favour of δ is likewise increasing. Hence the additional ballot A>B makes it likelier that A is better than B, which is enough to confirm the participation criterion.
Now introduce a third candidate C. The triple (x,y,z) of valences occupies a position in 3-D space but the ballots are indifferent between positions which differ along the direction (1,1,1), so we need only consider the projection of valences onto some plane orthogonal to this line. Looking towards such a plane from a distant point along the (1,1,1) line, we will see the x, y and z axes as pointing in different directions. The part of the plane in which A has greater valence than either rival lies to the bottom left of the blue dashed lines.^{[2] }
We are interested in the strength of evidence for a valence triple (as a log Bayes factor) provided by the ballot A>B>C. The Bayes factor f (x,y,z) is given by
f (x,y,z) = | p (A>B>C|x,y,z) |
p (A>B>C|ℋ) |
where ℋ is a suitable alternative hypothesis – for simplicity I choose it to be the random hypothesis under which all 6 rankings are equally likely.
We may now draw “isofactors” – contour lines in which the strength of evidence implied by the ballot is equal, shown in brown. If we have a probability distribution on (x,y,z), then a ballot will nudge it in the direction of increasing isofactors, usually reducing its dispersion.
So suppose we start off with a prior distribution on the valences centred on O, and that we refine it to a posterior in the light of n–1 ballots, and suppose that the distribution we thus obtain straddles the broken blue lines. Then the isofactors for a ballot A>B>C are roughly parallel (loosely speaking) to these lines, implying that the part of the distribution inside the region is boosted relative to the part outside it; i.e. the posterior likelihood of A being best increases in the light of the ballot, again confirming the participation criterion.
But it’s an arm-waving argument rather than a demonstration, and I suspect that it would be hard to be more rigorous. It may be rare or nonoccurrent for a failure of the participation criterion to arise naturally, but we can induce one by adopting an unnatural prior which treats the candidates asymmetrically (and which we would therefore be justified in rejecting). Suppose that the prior distribution on (x,y,z) is concentrated at two points P and Q, with P ever so slightly likelier than Q. Then A is the rightful winner before any ballots are cast. If a single ballot is cast A>B>C, then this ballot is likelier at Q than at P so it changes the relative weights of the two points in such a way that C is now likelier than A to be best.
Well, I wouldn’t wish to place any weight on an argument starting from such a prior. The difficult question is whether a posterior distribution starting from an impartial prior can exhibit similar properties, as would happen if the posterior was a cigar-shaped distribution from P to Q. No one computes these posteriors, so no one knows what shapes they may take. We should say that the participation criterion is ‘almost valid’, or (perhaps better) that it can be adopted as a statistical approximation. I suspect it remains so for larger numbers of candidates.
Now let’s move on to the monotonicity criterion, which can be illustrated as before. We define the Bayes factor g(x,y,z) as meaning that the ballot A>B>C is g times likelier than the ballot B>A>C for the valences (x,y,z). Assuming that the candidates are modelled anonymously it follows that
g(x,y,z) = | f (x,y,z) |
f (y,x,z) |
which accounts for the left-right symmetry in the diagram and makes clear the kinship between monotonicity and the participation criterion.
So suppose that we have an election E comprising n ballots, the last of which is cast as B>A>C, and suppose that A is likeliest to be best under the posterior distribution generated by E. Consider an alternative election E' which differs from E in the final ballot instead being A>B>C. Monotonicity requires A to be necessarily the winner of E'.
The isofactors are roughly vertical in the diagram as drawn, and in particular the isofactor corresponding to a log Bayes factor of 0 coincides with the line separating x>y from x<y. It follows that the probability of A being best can only increase relative to the probability of B being best as we move from E to E'. This may be the intuition behind the monotonicity criterion, but it doesn’t get us very far because the criterion isn’t stated in terms of posterior probabilities, but in terms of electoral decisions.
The danger is that as we move from E to E', C may take over from A as the likeliest best candidate. This will certainly happen if the posterior distribution arising from E is concentrated at the two points P and Q with very slightly more weight at Q. The Bayes factor obtained by moving from E to E' tilts the weight from Q to P, making C the likeliest best candidate. The position is similar to what we saw previously, so we may likewise regard the monotonicity criterion as having the status of a statistical approximation.
Spatial models are harder to analyse, and the parallel between monotonicity and the participation criterion breaks down. Start with the participation criterion.
Suppose we have a single election E whose rightful winner is A: then the average distance from A to voters is less than the average for other candidates. Now suppose we produce a new election E' by adding a voter whose ballot is A>B>C. Clearly A is the rightful winner of E' by a larger margin than of E – which perhaps establishes the criterion under a sort of free-variable spatial sematics.
But to validate it properly we need to average probabilistically over all possible distributions of voters and candidates in space. Partition the distributions into two sets: S in which A is the rightful winner under E, and S in which he is not.
Then, as before, adding a voter whose ballot is A>B>C leaves A the rightful winner of all ballots in S and may lead to him becoming rightful winner of a few in S. The pitfall is that preferences lower down the ballot may boost the likelihood of S relative to S, thereby toppling A from his position. It is difficult to see any reason why this cannot happen; so it seems that the participation criterion cannot be valid under a spatial model.
The case is different for the monotonicity criterion. When we replace a voter whose ballot is B>A>... by one whose ballot is A>B>... we leave the lower preferences unaltered, making it hard to tilt the likelihood between S and S. However we can construct a counterexample to the criterion by having recourse to another contrived prior.
D>A>B>C | 10 | |
A>D>B>C | 1 | |
A>B>D>C | 1 | |
B>A>C>D | 2 | |
B>C>A>D | 3 | |
C>B>A>D | 3 |
But suppose that one of the B>A>C>Ds is replaced by A>B>C>D, which is incompatible with the linear order because a voter cannot prefer A to B without also preferring D to C. We now have to assume the tetrahedral arrangement, and D is likely to be the rightful winner (and will certainly be so if we assume that the voters cluster at the vertices).
The prior is certainly unnatural – but has anyone ever said that the monotonicity criterion rests on an assumption of smoothness? The criterion seems again to be a reasonable statistical approximation but not a logical truth.
The results we have obtained are consistent with the known properties of the criteria. Under a jury model, both approximations are satisfied by the Borda count, which is known to be almost optimal. Under a spatial model, monotonicity alone of the two has validity as a statistical approximation, and it alone is satisfied by the Condorcet methods which give the best performance. We would not go far wrong if we allowed ourselves to be guided by these approximations in choosing a method.
[I made a terminological comment here, but Kevin Venzke pointed out that it was mistaken.]
If we are willing to accept the monotonicity criterion as valid, then it can be used to judge the accuracy of voting methods. It doesn’t just furnish a theoretical objection: the violation of a valid criterion must point to an incorrect result.
A>B>C | 4 | ||
A>C>B | 2 | ||
B>A>C | 1 | ||
B>C>A | 4 | ||
C>A>B | 3 | ||
C>B>A | 3 |
Now construct an election E' in which the two ballots A>C>B are replaced by C>A>B. The new Borda counts are (12,17,22) and I estimate the probabilities of being best as 12:28:60, so C is more emphatically the rightful winner than before. However this is a monotonicity failure for IRV in which C is deprived of victory as a consequence of getting extra votes, with B being elected instead. And as well as being a monotonicity failure, it’s an incorrect result: B is simply the wrong winner of E', not just by comparison with another voting method, but by comparison with the actual facts.
The posterior probabilities I’ve quoted were obtained from a very slowly converging Metropolis-Hastings integration (requiring about 10^{10} samples) and may not be exact. See the software below. It implements a Bayesian decision-theoretic voting method for a jury model analogous to my previous effort for a spatial model. In my previous work I needed strong prior information in order to make headway but in the present case the model is so much simpler that I can avoid excessively restrictive assumptions.
A viewpoint I can now present is that a “Bayesian decision-theoretic voting method” may be more than a voting method: if conducted with full rigour it will be a direct determination of the truth. If inexact, then it is on a par with other methods (though less ad hoc).
The criteria used in Arrow’s Impossibility Theorem refer to slightly different electoral decisions than those above, since they concern the selection of a ranking rather than a single candidate. To avoid unnecessary confusion I will discuss his criteria as they apply to the simpler case.
He gave one of his criteria the name “Pareto Principle”. Applied to the selection of a single candidate, this principle reduces to the unanimity criterion as stated above, and is valid under a Bayesian semantics. However Arrow sees so little need to argue for it – remarking that “it’s hard to imagine anyone quarrelling with it”^{[3]} – that he leaves us in the dark about its meaning. It would in fact be very easy to quarrel with his “Pareto Principle” under Tarskian semantics since it’s perfectly possible for the electorate to be unanimously wrong. Meanwhile the absence of any statistical reasoning makes it impossible to interpret Arrow’s principle in an evidential sense; so we’re left assuming that his implicit semantics were formalistic.
This is confirmed by his terminology. The true Pareto Principle states that if A is better than B for each individual, then it is better for society as a whole. The unanimity criterion states that if each individual expresses the opinion that A is better than B, then a democratic decision cannot overturn their view. The facts of the matter have nothing to do with it. By attaching Pareto’s name to the criterion, Arrow committed himself to treating opinions as the ultimate facts.
Arrow’s other main criterion was IIA, which states that if A is the rightful winner of an election E between A, B, and C, then A is also the rightful winner of an election E' between A and B in which the ballots differ simply by the removal of C.
This is clearly invalid under a Bayesian semantics. Suppose that a jury model applies and that the ballots in E split 49% A>C>B and 51% B>A>C. Then A will be the rightful winner because A>C>B is stronger evidence of A being better than B than B>A>C is of the opposite. But when C is removed from the ballots the difference in strength disappears, and B becomes the rightful winner. It is impossible to see any irrelevance in the information elicited by C’s candidacy.
Indeed if the correctness of an electoral decision is understood evidentially, then IIA is patently absurd: it states that if the evidence supports A, then different evidence must also support A because A is in fact the rightful winner.^{[4]}
The mystery is how IIA can ever have been taken for a logical principle. It seems that Arrow viewed it along the lines of Tarskian semantics. Thus the hypothetical option of teleportation is “irrelevant” to civic transport decisions because it wouldn’t affect the relative merits of “rapid transit, automobile and bus”, these being conceived as real-world properties. So if A is the rightful winner, then it is a fact about reality that he is preferable to the other candidates, and no additional evidence can shake the fact.
Arrow advanced no semantics of his own. His “Pareto Principle” can only be understood formalistically and is untenable under Tarskian semantics, whereas IIA can only possibly have a Tarskian justification; so perhaps it is no surprise that Arrow’s argument is a proof by contradiction.
The rationale behind the logical criteria has always been unclear. A little while ago Wikipedia told its readers that “there is no ultimate authority on which criteria should be considered more important than others”.^{[5]} Voting theory is left as little more than a mystical religion.
The criteria have been presented as an independent source of knowledge on the merits of voting methods – independent, that is, of empirical evaluation. But if they need to rest on a semantics, and the semantics must be Bayesian, then the criteria can do no more than crystallise evidential properties which can be measured more systematically through evaluation.
I’m sorry not to have got further in analysing the soundness of the participation and monotonicity criteria. I hope at least to have shown that it is possible to discuss their validity in a mathematical framework. And as for their correctness... well, the onus of proving the soundness of a logical principle does rather fall on the people advocating it.
My final remark is something which I have always regarded as obvious, but which seems not to be universally accepted, namely that it makes no sense to claim that a voting method is better or worse than another except under a particular electoral model. It has been known for more than 50 years that Condorcet methods are optimal under certain spatial models, and known since Peyton Young’s work that the Borda count is almost optimal under jury models, yet people still try to discuss the merits of different voting methods in a semantic vacuum.
[1]. My knowledge of Saari’s argument is derived solely from Eric Pacuit’s Stanford Encyclopedia article (“Voting Methods”). I assume sincere voting throughout.
[2]. The diagram assumes unit variance for the epsilons. Logarithms are to base 10.
[3]. Section 4 of “Values and Collective Decision-Making”.
[4]. Good and Tideman make a similar argument in their discussion: “if the set of candidates changes, thereby changing the available evidence, then there is no reason why the collective ordering of a pair should not sometimes be reversed”. They attribute similar thoughts to Buchanan and Plott.
I assume that candidate valences come from a zero-mean Gaussian with unit variance – this defines the scale factor. I posit a hyperparameter α drawn from a zero-mean Gaussian with standard deviation 5 – this provides a large degree of diffuseness in the model. And then I assume that the epsilons come from a zero-mean Gaussian with variance e^{α}. So the noise may be much louder or much quieter than the signal. I find that α is centred on 2.7 in E and on 1.5 in E', implying that the noise is always louder than the signal and is louder (as we’d expect) in E than in E'.
In principle we might allow non-Gaussian distributions, or functional relationships between candidate valences and the variances of the associated error terms, but these have the air of gratuitous over-complication. Setting them aside, the only freedom left in the model is in the SNR, and there is no need to make any significant assumption here since it is as easy to average over it as it is to assume a predetermined value.
#include <math.h> #include "memory.h" double gaussv(),ranf() ; static int nbal[6] = { 4 , 2 , 1 , 4 , 3 , 3 } , idex[64] ; static int a[6][3] = { {0,1,2},{0,2,1},{1,0,2},{1,2,0},{2,0,1},{2,1,0} } ; static int getballots(double *x,int *count,int *target) { int i,j ; xi eval[3] ; for(i=0;i<6;i++) count[i] = 0 ; for(i=0;i<17;i++) { for(j=0;j<3;j++) eval[j] = xi(x[j]-x[3+j+3*i],j) ; if(eval[1].x>eval[0].x) swap(eval[0],eval[1]) ; if(eval[2].x>eval[1].x) { swap(eval[1],eval[2]) ; if(eval[1].x>eval[0].x) swap(eval[0],eval[1]) ; } j = idex[(eval[0].i<<4)|(eval[1].i<<2)|eval[2].i] ; if(count[j]==target[j]) return -1 ; count[j] += 1 ; } return 0 ; } static double logpdf(double *x) { double alpha=x[54],var=exp(alpha),logp=-(alpha*alpha)/25-51*log(var) ; int i ; for(i=0;i<3;i++) logp -= x[i]*x[i] ; for(i=3;i<54;i++) logp -= x[i]*x[i]/var ; return logp/2 ; } int main(int argc,char **argv) { int i,j,**bal=imatrix(6,3),*count=ivector(6),i0,i1,i2,r0,r1 ; double q,qdash,eps,x[55],y[55],qalpha ; long int niter,iter,n[3],win[3] ; if(argc>1) niter = (long) pow(10,atof(argv[1])) ; else niter = 1000000 ; for(i=0;i<64;i++) idex[i] = -1 ; for(i=0;i<6;i++) idex[(a[i][0]<<4)|(a[i][1]<<2)|a[i][2]] = i ; // nbal[4] += 2 ; nbal[1] -= 2 ; // uncomment this line to use election E' // find a legal starting position by random trials for(x[54]=iter=0;iter<1000000;iter++) { for(i=0;i<54;i++) x[i] = gaussv() ; if(getballots(x,count,nbal)==0) break ; } for(i=0;i<6;i++) printf("%3d",count[i]) ; printf("\n") ; q = logpdf(x) ; for(i=0;i<3;i++) win[i] = n[i] = 0 ; // metropolis-hastings for(eps=0.002,qalpha=iter=0;iter<niter;iter++) { for(i=0;i<3;i++) if(i==0||x[i]>x[j]) j = i ; if(iter>=10000) { win[j] += 1 ; qalpha += x[54] ; } for(i=0;i<54;i++) y[i] = x[i] + eps*gaussv() ; y[54] = x[54] + 100*eps*gaussv() ; if(getballots(y,count,nbal)<0) { n[2] += 1 ; continue ; } qdash = logpdf(y) ; if(exp(qdash-q)<ranf()) n[1] += 1 ; else { q = qdash ; for(i=0;i<55;i++) x[i] = y[i] ; n[0] += 1 ; } } for(q=i=0;i<3;i++) q += n[i] ; q = 100 / q ; printf("okay: %.2f%%; unlikely: %.2f%%; impossible:%.2f%%\nprobs: ", n[0]*q,n[1]*q,n[2]*q) ; for(q=i=0;i<3;i++) q += win[i] ; printf("%.2f%% : %.2f%% : %.2f%% [%.1f]\n", 100.0*win[0]/q,100.0*win[1]/q,100.0*win[2]/q,qalpha/q) ; }