Escolar Documentos
Profissional Documentos
Cultura Documentos
Hillary Clinton vs. Bernie Sanders: Debunking Some Election Fraud Allegations
In 2010, the two of you used a Benford's Law Test, if I am not mistaken, to analyze results in a South
Carolina primary contest. The results were reported on here at FiveThirtyEight:
http://fivethirtyeight.com/features/sc-democratic-primary-getting-weirder/
I am wondering if you have or would be willing to run a similar statistical analysis for the New York
Democratic Presidential Primary results, especially as released on election night. Some of the results
(like basically perfect 60-40 and 70-30 splits in Kings County and the Bronx ... there are several others,
including a virtually perfect 58-42 split for the overall results) seem ripe for a Benford's Test, but I am
not expert in stats and elections.
I've attached a document with the county level results from election night as published by Politico.
There have been some slight adjustments with the additions of absentee and affidavit ballots since these
results.
What I can promise is that I will publish on the results of your findings whichever way the test comes
out.
Message body
Thank you, Profs. Miller and Mebane, for your quick replies.
Could we start, then, with the New York City results? Those are available at the precinct level in pdf
Message body
I'll echo Walter here. If we are supplied with a clean, uniformly coded file with all necessary
information included (as we were during the Rawls-Greene affair) we can possibly tell you something.
But I don't think either of us can take the time to clean data.
Message body
Ok. I'll have that done and get back to you. I have voter registration data. Let me see if it is by Election
District (precinct).
For clarity, what is the clean formatting you would like?
Doug
CC
Walter Mebane
Michael Miller
smm424@cornell.edu
Message body
We have a database of all NYS voters that we obtained by FOIL. Checking to see if it includes a
breakdown by election district. I doubt it. Might have to go back to NYC BOE for that.
I am looping Stewart McCauley in here as he is going clean up the dataset (or co-ordinate doing so).
Knowing the precise specifications for the data set that would make your job easiest would be best.
Doug Johnson Hatlem
Stewart M. McCauley
Message body
Profs. Mebane and Miller,
Attached you will find a .csv file with the information, we believe, in workable format (organized first
by Assembly District + Election District with which Congressional District each falls into in the far left
column). This includes all registered voters by ED (including GOP) with further columns with
breakdowns for DemsActive, Inactive and Purged (only purged may not vote). PreReg voters (column I
and M) are voters who registered as 17 yo about to turn 18 before November.
Message body
Doug,
I'll focus my response on your pdf document. I have looked at these data using a VERY quick Benford
test (ie, I would want some more time to carefully go through it before I said anything about it), but
that's really Walter's department so I won't have much to say about it until he runs it through his 2BL
test (which, to be clear, I did not do). If things went as your theory suggests, yes, I'd say that would
trigger Walter's test most of the time, but I'll let him cover that if he wants.
Thoughts on your report:
First question: Why are we not using certified data? Those are the votes that count. Lots of the things
you are mentioning in the report (see below) could be corrected in the certified data, if they are what I
think they are.
Table 2: It's not clear to me why larger precincts being better for Clinton should be an indication of
fraud/error. I calculate the overall correlation between total votes and Clinton percentage as .205.
That's a moderate correlation, but there are lots of plausible explanations. Knowing what we know
about Clinton's base of support, it could simply be that large precincts are located in areas with higher
minority populations (which would be consistent with practice in many states, and which you reference
on page 4). I couldn't say anything about it without knowing where these precincts are, and what the
racial makeup of them is. There is some allusion to controlling for race in your email, but there is no
race data in this file so I'm not sure how that was done.
More important, the relationship does not appear to be uniformly linear. So I drilled down a bit. I
attached a scatter plot of total votes against Clinton's vote percentage. You can see pretty clearly that
the relationship is actually positive only through total votes of about 150, and then it levels off and
disappears altogether. Indeed, the correlation coefficient is actually small and negative (and not
statistically significant) in precincts with more than 140 votes cast, indicating that in those precincts
there is no relationship between size and success of a given candidate. So the story really falls apart
under closer inspection, in my view.
Nor is Table 3 suggestive, in my experience looking at precinct returns in many states. It is extremely
common for counties to show empty precincts in an election, for lots of reasons. They may expect
future growth (or the opposite), they may have manpower shortages that lead them to consolidate, etc.
The other stuff (absentee ballots vs. affidavit ballots) could be any number of things. The data you sent
me don't allow me to check them out, because they don't separate absentee from election day from
affidavit totals. My question would be, are the absentees and affidavits both counted in the precinct
total that passes through to the county/city result, or is only one of the categories in there? If the latter,
again that's a common way that raw data come through from a voting system: absentees and other nonED votes sometimes get lumped together. The fact that the problem is particularly acute within one
geography would be consistent with that theory. That isn't fraud, or really, even evidence of poor
administration. It's just a glitch in the way raw data are pumped into the spreadsheet.
I would say that no conclusions should be drawn from the county-level analysis, or the type of thing
done in Tables 9-11. Since your samples aren't random, you're pretty exposed to claims of cherrypicking, and you are making some pretty crucial assumptions about voters in these precincts without
obvious sourcing. It would be really irresponsible to offer that section of the report as evidence of
anything, in my opinion.
I'm sure Walter can add something with a quick analysis of these data, but I don't see a smoking gun in
either the data or the report you sent, with the caveat that I am pretty limited with respect to what I can
do, given the way these data are categorized. If you could break out absentees vs. ED votes, I could say
more.___________________________________
Michael G. Miller
Assistant Professor
Message body
Thank you, Michael.
A few quick responses to your questions tonight, more tomorrow.
The way race was controlled for is in the xlsx file, which has several sheets attached. Harlem for
instance (in Manhattan where the vote share remains constant across precinct size) is also even across
various precinct sizes where as analysis of Latino heavy areas in the Bronx and Brooklyn show that,
even within Latino areas, Clinton increases share as precinct size increases. All of that data is in the
.xlsx file.
We do also have the certified data, and can pass that along. The idea is that the election night data is
where, if electronically rigged, things would show up. The election night numbers do not show
affidavit or overseas ballots; those are added in weeks later and are included in the certified results. We
figured this could throw off test results, but will certainly add them in another column as desired.
Yes, we noticed that at 150 votes the increase leveled off (but don't note a drop off). We see no
reasonable explanation (especially within the Bronx which has almost no white population ... it's hard
to say what it is exactly given a data entry error here (it says 10.2% for White alone, not hispanic ... but
the totals for all ethnicities is well over 100%, which isn't the case in any other county I've seen):
http://www.census.gov/quickfacts/table/SEX255214/36005
I agree on tables 9-11 and have told the person we are working that it couldn't really be used for similar
reasons you state.
Doug
Stewart M. McCauley
There is no cleartheoretical expectation for the slope of the LOESS. In a totally non-problematic
election, one might expect it to be horizontal over the entirety of the distribution. But:
1. (This point is my best poke at what you're asking for): Because there are lots of reasons for seeing
the pattern I got (higher votes in large minority precincts being most obvious), the pattern is not itself
evidence of fraud. There is a positive correlation in the lower range that disappears at 145 votes.
Certainly that is not suggestive of vote stuffing in large precincts as a rule. But the bottom line is that
just because things "look" weird to the human eye is not evidence of fraud. If I can reasonably point at
any other cause of a pattern, then absent other data/analysis it is irresponsible to conclude that the
election is fraudelent. I want to be very clear here that the plot I sent you should not be taken on its own
as anything close to evidence of vote manipulation.
2. The report makes definite allusion to a uniformly positive correlation, which I show is not present
overall in NYC precincts at large. I did not see an obvious way in the csv to do a separate plot by
borough, so I worked with what I had. I am happy to do it by borough if you can put a borough ID in
the csv, but absent reliable precinct race controls I'm not sure what a borough-level analysis is
conveying since to my knowledge delegates are not assigned at the borough level. If I have that wrong,
I can take a look.
3. The LOESS lines come from a local regression estimation. It looks to me like your guy is using a
line plot based on binned data. They are different (but similar) methods. Mine does not rely on any
binning and should in theory provide a more fluid estimation. But maybe I read it wrong.
4. I remain interesting in looking at precinct vs. absentee breakdowns, especially in Queens, given the
report.
Stewart M. McCauley
Michael,
First, I am clear that your scatter graph does not mean you think fraud is indicated. I will not pull a fast
one on you. Unless there is something very clear (like Walter's since walked back "substantial fraud"
comment), I will not remotely suggest you are stating fraud.
The reason to go by borough was precisely, I believe, to address this question of whether race or
ethnicity could explain the precinct size difference. If you would like to analyze, or suggest we analyze,
based on CD, we would by all means take CD 15 as indicative. CD 15 is now entirely within The
Bronx, has plenty of different sized precincts, and is majority Latino (to the tune of 66.1% across the
board). We could do the analysis first using all precincts in CD 15, then eliminating the very few that
aren't majority Latino (according to NYT data), if you'd like. I am copying Nick Bauer into the
So we are going to send you three data sets between tonight and tomorrow:
1) Nick is adding in a column for borough (assigned by number such as 1 = Kings 2 = Staten Islands)
to the document we sent you last night.
2) Stewart is working to add figures to the same document that will include absentee/affidavit ballots.
Those are note broken down by candidate. We just have a total for each in terms of number counted
plus the updated totals for the candidates. We should say that we aren't as confident as you appear to be
that there is not likely to be a difference between those percentages and election day percentages. There
have been pretty massive differences, in fact, in previous states for such figures.
3) I am going to provide an xlsx document with two sheets, one with all CD 15 precincts included; one
with CD 15 minus precincts that aren't majority Latino per the NYT data. Nick (though I haven't asked
him this yet) my do up a scatter or line graph for each of those.
Doug
CC
Stewart M. McCauley
Nicholas Bauer
OK, I will work with whatever you have. But if possible, send in csv form so I can read it into the
statistical package we use.
By Borough
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 24 at 9:11 PM
To
Michael Miller
CC
Stewart M. McCauley
Michael,
The Boroughs are now included, thanks to Nick, in the first .csv file with the Boroughs in the 4th
column where Manhattan = 1, Bronx = 2 Brooklyn = 3 Queens = 4 Staten Island = 5. I've eliminated
the word headers for the columns as you did for Prof. Mebane.
The 2nd csv file includes all ED's in the Bronx minus the 58 precincts the NYTs identifies as not
having a Latino majority.
Doug
Re: By Borough
Michael Miller <mgmiller@barnard.edu>
May 25 at 6:58 AM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Doug,
Walter and I use different packages. I need the column headers. I'm not going to work on the 2nd file
because I think an arbitrary bin like that is an insufficient means of control. The only race control that
would be useful would be universal, precinct-level data.
Re: By Borough
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 25 at 8:47 AM
To
Michael Miller
CC
Stewart M. McCauley
I am not terribly surprised that confirmed majority Latino precincts in a county confirmed by US
Census figures to be 54.8% Latino is not sufficient. If I do use those figures in an article, however, I'll
feel plenty of justification for saying there is no good explanation for the increase by precinct size (in
this case a 14% swing for precincts with 50 or fewer voters versus precincts with 200+ voters) . In fact,
we'll likely plot what all majority Latino precincts (over 1000, or ~20% city wide) look like versus the
relative straight LOESS lines in Manhattan and Harlem. Readers and other stats minded people can
decide. I'll be sure to note your clear objections if I do so.
Attached is the Borough document with headers.
Very much appreciate that work you are putting into this and the frank back and forth.
Doug
Message body
If you want to make the case why they are plausible with specific facts related to the NYC campaign or
what we reasonably know about Clinton v Sanders supporters, I'll be happy to discuss. I'd very much
like to hear why Clinton targeted voters everywhere in NYC but Manhattan or why, against evidence
sometimes in video form from multiple states, Clinton supporters are more likely to stand in long lines
(which by the way, there were virtually no reports of such in NYC). Again, if I report on this, I will
include *all* your objections, but they will not be taken without comparison to actual data, facts, etc.
That's simply how I report and function as someone who was on multiple national championship
winning switch side college debate teams. Yes, address all arguments. No, a weakly supported
argument doesn't go very far. Always the goal is to address the *very strongest* part of a competing
argument.
Doug
Re: By Borough
People
Michael Miller <mgmiller@barnard.edu>
May 25 at 11:40 AM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Walter Mebane
Doug,
If I can be frank, it's not my job to make a case. I don't have a dog in the fight, and I don't much care
what the outcome of an analysis is here. You asked me whether I thought there was fraud. And in my
professional opinion, there is no clear evidence of fraud in this election given the data I have seen
either in the precinct data you supplied or in your report. I believe in that conclusion even more in light
of Walter's analysis.
I do, however, now have serious questions about the objectivity of this reporting. It really seems to me
that you're going to report that the election is fraudulent no matter what we say. Doing this kind of
work takes time. So, since in my view my findings will not be fairly treated if they turn out to be null, I
will not be conducting any additional analysis here. I am also clarifying that any findings I have
reported to you should be deemed exploratory and not final. I do not authorize them for public release.
I do not give you my permission to use my name, institutional affiliation, or credentials in your report,
in any circumstance, even to contradict your claims. Good luck with the story.
Re: By Borough
People
CC
Michael Miller
Stewart M. McCauley
I have not determined how I will report on this. I am most sorry that you don't appreciate healthy give
and take on the details of particular theories. If you, Walter, or any other statistician makes a
compelling case, I will use that case as part of debunking certain fraud theories, just as I have here,
with reference to four particular states:
Hillary Clinton vs. Bernie Sanders: Debunking Some Election Fraud Allegations
I care about truth as much as I do about justice. If you have something to offer in terms of evidence,
argumentation, and supportable non-fraudulent theories ... which clearly you do ... I take those things
very seriously. My reporting bears that out and will continue to do so.
Doug
Re: By Borough
Michael Miller <mgmiller@barnard.edu>
May 25 at 11:58 AM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Do not contact me again. If my name is used in your story I will refer the issue to my university
counsel.
Re: By Borough
People
Stewart M. McCauley
xxxx@comcast.net
Fair enough. This will be my last email to you. And your name *will be* used in an article given
this threat (weak as it is). That's how fair but tough adversarial journalism works. I've cc'ed my
editor so he can see the thread.
Doug
Re: By Borough
Michael Miller <mgmiller@barnard.edu>
May 25 at 12:39 PM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
sitka@comcast.net
I like your style Doug. Let's dance.
[at this point, Prof. Miller began his 25 tweet description of the previous conversations]
Re: By Borough
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 25 at 1:10 PM
To
CC
Stewart M. McCauley
http://giphy.com/gifs/redoaks-red-oaks-xT1XGJVARw89Dr3ifu