Você está na página 1de 17

Benford's Law Test

Doug Johnson Hatlem <djjohnso@yahoo.com>


May 20 at 11:59 AM
To
wmebane@umich.edu
mgmiller@barnard.edu
Hello Drs. Mebane and Miller,

I am a reporter at CounterPunch currently writing on Election Fraud Allegations in the Democratic


Primary. Where I can, I am debunking certain claims; where I can't, I try to give the best possible
explanation.

Hillary Clinton vs. Bernie Sanders: Debunking Some Election Fraud Allegations

In 2010, the two of you used a Benford's Law Test, if I am not mistaken, to analyze results in a South
Carolina primary contest. The results were reported on here at FiveThirtyEight:
http://fivethirtyeight.com/features/sc-democratic-primary-getting-weirder/
I am wondering if you have or would be willing to run a similar statistical analysis for the New York
Democratic Presidential Primary results, especially as released on election night. Some of the results
(like basically perfect 60-40 and 70-30 splits in Kings County and the Bronx ... there are several others,
including a virtually perfect 58-42 split for the overall results) seem ripe for a Benford's Test, but I am
not expert in stats and elections.
I've attached a document with the county level results from election night as published by Politico.
There have been some slight adjustments with the additions of absentee and affidavit ballots since these
results.
What I can promise is that I will publish on the results of your findings whichever way the test comes
out.

Thank you for your time!

Doug Johnson Hatlem

Re: Benford's Law Test


Michael Miller <mgmiller@barnard.edu>
May 20 at 12:54 PM
To
Walter Mebane
CC
Doug Johnson Hatlem
Hi Doug,
If you only have county data, a breakdown of absentee vs. ED voting could also possibly be
informative with respect to the possibility of machine glitches. But yeah, we'd need precinct data to say
anything more than that. It's worth noting though that "perfect splits" do occur all the time by random
chance, so they shouldn't be taken as evidence of a suspect outcome on their own.

Re: Benford's Law Test


People

Doug Johnson Hatlem <djjohnso@yahoo.com>


May 20 at 1:37 PM
To
Michael Miller
Walter Mebane

Message body
Thank you, Profs. Miller and Mebane, for your quick replies.

Could we start, then, with the New York City results? Those are available at the precinct level in pdf

and csv format here:


http://vote.nyc.ny.us/html/results/results.shtml
I have a message into the New York State Board of Elections to try to get a hold of the precinct level
results across the state.
Much appreciated,
Doug Johnson Hatlem

Re: Benford's Law Test


Michael Miller <mgmiller@barnard.edu>
May 20 at 1:58 PM
To
Walter Mebane
CC
Doug Johnson Hatlem

Message body
I'll echo Walter here. If we are supplied with a clean, uniformly coded file with all necessary
information included (as we were during the Rawls-Greene affair) we can possibly tell you something.
But I don't think either of us can take the time to clean data.

Re: Benford's Law Test


People
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 20 at 2:04 PM
To
Michael Miller
Walter Mebane

Message body
Ok. I'll have that done and get back to you. I have voter registration data. Let me see if it is by Election
District (precinct).
For clarity, what is the clean formatting you would like?

Doug

Re: Benford's Law Test


Doug Johnson Hatlem <djjohnso@yahoo.com>
May 20 at 2:22 PM
To

CC

Walter Mebane
Michael Miller
smm424@cornell.edu

Message body
We have a database of all NYS voters that we obtained by FOIL. Checking to see if it includes a
breakdown by election district. I doubt it. Might have to go back to NYC BOE for that.
I am looping Stewart McCauley in here as he is going clean up the dataset (or co-ordinate doing so).
Knowing the precise specifications for the data set that would make your job easiest would be best.
Doug Johnson Hatlem

NYC Election Data/


Doug Johnson Hatlem <djjohnso@yahoo.com>
May 23 at 8:32 PM
To
Walter Mebane
Michael Miller
CC

Stewart M. McCauley

Message body
Profs. Mebane and Miller,
Attached you will find a .csv file with the information, we believe, in workable format (organized first
by Assembly District + Election District with which Congressional District each falls into in the far left
column). This includes all registered voters by ED (including GOP) with further columns with
breakdowns for DemsActive, Inactive and Purged (only purged may not vote). PreReg voters (column I
and M) are voters who registered as 17 yo about to turn 18 before November.

This data is from election night rather than current as updated/certified.


I am attaching two other documents from someone we are working with who compared votes by
precinct size across NYC and discovered that, other than in Manhattan, Clinton vote share rose and
Sanders sunk as the precinct size grew larger. He ran some tests as well to control for race/ethnicity.
We'd love your opinion on these. The first document (xlsx) is the raw data with some associated graphs.
The second is a PDF with a narrative description of the work along with some of the graphs.
Thank you so much for doing this. Our current working theory (which we have some evidence for
outside the data) is that vote totals may have been changed at the central point for data entry by
county/borough, perhaps with a specific percentage target for the particular county in mind. (Might this
avoid detection with Benford's Law Test?) We have spoken, for instance, with a precinct worker who
recorded the total from the ticker tape at the end of the night for his AD/ED. It doesn't match the total
announced on election night (and in this data).
Again, my commitment is to publish including your views of what the data tells us, if anything, even if
they are not amenable to a theory of fraud.
Best,
Doug Johnson Hatlem

Re: NYC Election Data/


Michael Miller <mgmiller@barnard.edu>
May 23 at 10:13 PM
To
Doug Johnson Hatlem
CC
Walter Mebane
Stewart M. McCauley

Message body
Doug,
I'll focus my response on your pdf document. I have looked at these data using a VERY quick Benford
test (ie, I would want some more time to carefully go through it before I said anything about it), but
that's really Walter's department so I won't have much to say about it until he runs it through his 2BL
test (which, to be clear, I did not do). If things went as your theory suggests, yes, I'd say that would
trigger Walter's test most of the time, but I'll let him cover that if he wants.
Thoughts on your report:

First question: Why are we not using certified data? Those are the votes that count. Lots of the things
you are mentioning in the report (see below) could be corrected in the certified data, if they are what I
think they are.
Table 2: It's not clear to me why larger precincts being better for Clinton should be an indication of
fraud/error. I calculate the overall correlation between total votes and Clinton percentage as .205.
That's a moderate correlation, but there are lots of plausible explanations. Knowing what we know
about Clinton's base of support, it could simply be that large precincts are located in areas with higher
minority populations (which would be consistent with practice in many states, and which you reference
on page 4). I couldn't say anything about it without knowing where these precincts are, and what the
racial makeup of them is. There is some allusion to controlling for race in your email, but there is no
race data in this file so I'm not sure how that was done.
More important, the relationship does not appear to be uniformly linear. So I drilled down a bit. I
attached a scatter plot of total votes against Clinton's vote percentage. You can see pretty clearly that
the relationship is actually positive only through total votes of about 150, and then it levels off and
disappears altogether. Indeed, the correlation coefficient is actually small and negative (and not
statistically significant) in precincts with more than 140 votes cast, indicating that in those precincts
there is no relationship between size and success of a given candidate. So the story really falls apart
under closer inspection, in my view.
Nor is Table 3 suggestive, in my experience looking at precinct returns in many states. It is extremely
common for counties to show empty precincts in an election, for lots of reasons. They may expect
future growth (or the opposite), they may have manpower shortages that lead them to consolidate, etc.
The other stuff (absentee ballots vs. affidavit ballots) could be any number of things. The data you sent
me don't allow me to check them out, because they don't separate absentee from election day from
affidavit totals. My question would be, are the absentees and affidavits both counted in the precinct
total that passes through to the county/city result, or is only one of the categories in there? If the latter,
again that's a common way that raw data come through from a voting system: absentees and other nonED votes sometimes get lumped together. The fact that the problem is particularly acute within one
geography would be consistent with that theory. That isn't fraud, or really, even evidence of poor
administration. It's just a glitch in the way raw data are pumped into the spreadsheet.
I would say that no conclusions should be drawn from the county-level analysis, or the type of thing
done in Tables 9-11. Since your samples aren't random, you're pretty exposed to claims of cherrypicking, and you are making some pretty crucial assumptions about voters in these precincts without
obvious sourcing. It would be really irresponsible to offer that section of the report as evidence of
anything, in my opinion.
I'm sure Walter can add something with a quick analysis of these data, but I don't see a smoking gun in
either the data or the report you sent, with the caveat that I am pretty limited with respect to what I can
do, given the way these data are categorized. If you could break out absentees vs. ED votes, I could say
more.___________________________________
Michael G. Miller
Assistant Professor

Department of Political Science


Barnard College, Columbia University
Phone: 212-854-6181
Personal Website: http://www.michaelgmiller.com/

Re: NYC Election Data/


People

Doug Johnson Hatlem <djjohnso@yahoo.com>


May 23 at 10:29 PM
To
Michael Miller
CC
Walter Mebane
Stewart M. McCauley

Message body
Thank you, Michael.
A few quick responses to your questions tonight, more tomorrow.
The way race was controlled for is in the xlsx file, which has several sheets attached. Harlem for
instance (in Manhattan where the vote share remains constant across precinct size) is also even across
various precinct sizes where as analysis of Latino heavy areas in the Bronx and Brooklyn show that,
even within Latino areas, Clinton increases share as precinct size increases. All of that data is in the
.xlsx file.
We do also have the certified data, and can pass that along. The idea is that the election night data is
where, if electronically rigged, things would show up. The election night numbers do not show
affidavit or overseas ballots; those are added in weeks later and are included in the certified results. We
figured this could throw off test results, but will certainly add them in another column as desired.
Yes, we noticed that at 150 votes the increase leveled off (but don't note a drop off). We see no
reasonable explanation (especially within the Bronx which has almost no white population ... it's hard
to say what it is exactly given a data entry error here (it says 10.2% for White alone, not hispanic ... but
the totals for all ethnicities is well over 100%, which isn't the case in any other county I've seen):
http://www.census.gov/quickfacts/table/SEX255214/36005
I agree on tables 9-11 and have told the person we are working that it couldn't really be used for similar
reasons you state.

Doug

Re: NYC Election Data/


Doug Johnson Hatlem <djjohnso@yahoo.com>
May 23 at 10:35 PM
To
Michael Miller
CC
Walter Mebane
Stewart M. McCauley
So, just to be clear. The data in the columns is only the election day count. There are no votes included
that did not go through the DS200 electronic voting machines which spit out a ticker tape report at the
end of the night. No absentee or affidavit balloting data are included here at all. (But we can get that to
you as desired.)
Doug

Re: NYC Election Data/


People

Michael Miller <mgmiller@barnard.edu>


May 23 at 10:45 PM
To
Doug Johnson Hatlem
CC
Walter Mebane
Stewart M. McCauley
Doug,
Sorry, I still don't see race data, but the way you're describing it won't really be useful unless it's
captured at the precinct level. That could possibly be done with GIS, and some states keep this
information in their redistricting files, but I don't know about New York. The non-electronic votes are
useful because presumably people in a given precinct who vote absentee don't deviate in preference
much from those who vote on Election Day. So, comparing votes between the two groups can be
informative.

Re: NYC Election Data/


People

Doug Johnson Hatlem <djjohnso@yahoo.com>


May 24 at 12:42 PM
To
Michael Miller
CC
Stewart M. McCauley
Looking at your later note along with your scatter test graph from last night with new eyes for a new
day ... a couple of questions.
I am trying to wrap my head around your scatter graph. Should the LOESS line be basically straight or
no, in your analysis? Just need to be clear if I am to explain to my readers eventually. The graphs by
Nick by borough and for Harlem versus 50% Latino precincts in The Bronx seem much more straight
forward. I am not sure I understand your criticism of them.
If it is simply a matter of how race/ethnicity is controlled for, Nick especially focused on AD 84, along
with several other precincts that are 50% or more Latino as reported by the New York Times here:
http://www.nytimes.com/interactive/2016/04/19/us/elections/new-york-city-democratic-primaryresults.html?_r=0#11/40.8302/-73.8885
As you can see, nearly all of AD 84 (The Bronx, 55% Latino per US census) fits in that model and the
Times map includes precinct level data as to which are greater than 50% Latino. The LOESS lines are
basically straight in Harlem (all 50%+ black), which is in Manhattan where we suspect no
manipulation. In the Bronx, Latino areas instead show the pattern of greater share as precinct size
increases, with definitively non-straight lines. The specific sheet I am referring to here is Bronx AD 84.
If you can give a layman's explanation for why this data does not show potential for fraud, I will be
happy to pass it to my readers.
Doug

Re: NYC Election Data/


Michael Miller <mgmiller@barnard.edu>
May 24 at 4:07 PM
To
Doug Johnson Hatlem
CC

Stewart M. McCauley
There is no cleartheoretical expectation for the slope of the LOESS. In a totally non-problematic
election, one might expect it to be horizontal over the entirety of the distribution. But:
1. (This point is my best poke at what you're asking for): Because there are lots of reasons for seeing
the pattern I got (higher votes in large minority precincts being most obvious), the pattern is not itself
evidence of fraud. There is a positive correlation in the lower range that disappears at 145 votes.
Certainly that is not suggestive of vote stuffing in large precincts as a rule. But the bottom line is that
just because things "look" weird to the human eye is not evidence of fraud. If I can reasonably point at
any other cause of a pattern, then absent other data/analysis it is irresponsible to conclude that the
election is fraudelent. I want to be very clear here that the plot I sent you should not be taken on its own
as anything close to evidence of vote manipulation.
2. The report makes definite allusion to a uniformly positive correlation, which I show is not present
overall in NYC precincts at large. I did not see an obvious way in the csv to do a separate plot by
borough, so I worked with what I had. I am happy to do it by borough if you can put a borough ID in
the csv, but absent reliable precinct race controls I'm not sure what a borough-level analysis is
conveying since to my knowledge delegates are not assigned at the borough level. If I have that wrong,
I can take a look.
3. The LOESS lines come from a local regression estimation. It looks to me like your guy is using a
line plot based on binned data. They are different (but similar) methods. Mine does not rely on any
binning and should in theory provide a more fluid estimation. But maybe I read it wrong.
4. I remain interesting in looking at precinct vs. absentee breakdowns, especially in Queens, given the
report.

Re: NYC Election Data/


Doug Johnson Hatlem <djjohnso@yahoo.com>
May 24 at 5:02 PM
To
Michael Miller
CC

Stewart M. McCauley

Michael,

First, I am clear that your scatter graph does not mean you think fraud is indicated. I will not pull a fast
one on you. Unless there is something very clear (like Walter's since walked back "substantial fraud"
comment), I will not remotely suggest you are stating fraud.
The reason to go by borough was precisely, I believe, to address this question of whether race or
ethnicity could explain the precinct size difference. If you would like to analyze, or suggest we analyze,
based on CD, we would by all means take CD 15 as indicative. CD 15 is now entirely within The
Bronx, has plenty of different sized precincts, and is majority Latino (to the tune of 66.1% across the
board). We could do the analysis first using all precincts in CD 15, then eliminating the very few that
aren't majority Latino (according to NYT data), if you'd like. I am copying Nick Bauer into the

conversation now since he is doing those analyses.


We will get back to you shortly on the feasibility of the absentee ballots for Queens. Would you prefer
overseas, other absentee, and affidavit ballots separated out in various columns or altogether in one
column either themselves or in combination with election day results?
Doug

Re: NYC Election Data/


People

Michael Miller <mgmiller@barnard.edu>


May 24 at 5:32 PM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Doug, Good. The more detailed you can make it, the better. I can easily tabulate them later. I think the
trends you're interested in should be analyzed within the geographies from which delegates are
assigned. The reason is that if a fraud is perpetuated, it will be to get delegates. I don't think delegates
are assigned by county in NY right? So CD would be right. But again, I wouldn't find an analysis
convincing without at minimum precinct-level race controls. All that said, for the sake of clarity, I am
looking for the absentees at the precinct level with a CD identifier. That analysis would be, I think, the
last useful piece of information I can provide you given Walter's findings. Not saying I'm not willing,
just saying I will reach the limits of the data at that point I think.

Re: NYC Election Data/


Doug Johnson Hatlem <djjohnso@yahoo.com>
May 24 at 6:31 PM
To
Michael Miller
CC
Stewart M. McCauley
Nicholas Bauer
Michael:
The reason for doing it by boroughs is partly because we believe there are varying levels of
corruptability by borough/county within NYC (and this isn't all hunch either).

So we are going to send you three data sets between tonight and tomorrow:

1) Nick is adding in a column for borough (assigned by number such as 1 = Kings 2 = Staten Islands)
to the document we sent you last night.
2) Stewart is working to add figures to the same document that will include absentee/affidavit ballots.
Those are note broken down by candidate. We just have a total for each in terms of number counted
plus the updated totals for the candidates. We should say that we aren't as confident as you appear to be
that there is not likely to be a difference between those percentages and election day percentages. There
have been pretty massive differences, in fact, in previous states for such figures.
3) I am going to provide an xlsx document with two sheets, one with all CD 15 precincts included; one
with CD 15 minus precincts that aren't majority Latino per the NYT data. Nick (though I haven't asked
him this yet) my do up a scatter or line graph for each of those.
Doug

Re: NYC Election Data/


Michael Miller <mgmiller@barnard.edu>
May 24 at 6:36 PM
To

Doug Johnson Hatlem

CC

Stewart M. McCauley
Nicholas Bauer

OK, I will work with whatever you have. But if possible, send in csv form so I can read it into the
statistical package we use.

By Borough
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 24 at 9:11 PM
To
Michael Miller
CC
Stewart M. McCauley
Michael,

The Boroughs are now included, thanks to Nick, in the first .csv file with the Boroughs in the 4th
column where Manhattan = 1, Bronx = 2 Brooklyn = 3 Queens = 4 Staten Island = 5. I've eliminated
the word headers for the columns as you did for Prof. Mebane.
The 2nd csv file includes all ED's in the Bronx minus the 58 precincts the NYTs identifies as not
having a Latino majority.
Doug

Re: By Borough
Michael Miller <mgmiller@barnard.edu>
May 25 at 6:58 AM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Doug,
Walter and I use different packages. I need the column headers. I'm not going to work on the 2nd file
because I think an arbitrary bin like that is an insufficient means of control. The only race control that
would be useful would be universal, precinct-level data.

Re: By Borough
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 25 at 8:47 AM
To
Michael Miller
CC
Stewart M. McCauley
I am not terribly surprised that confirmed majority Latino precincts in a county confirmed by US
Census figures to be 54.8% Latino is not sufficient. If I do use those figures in an article, however, I'll
feel plenty of justification for saying there is no good explanation for the increase by precinct size (in
this case a 14% swing for precincts with 50 or fewer voters versus precincts with 200+ voters) . In fact,
we'll likely plot what all majority Latino precincts (over 1000, or ~20% city wide) look like versus the
relative straight LOESS lines in Manhattan and Harlem. Readers and other stats minded people can
decide. I'll be sure to note your clear objections if I do so.
Attached is the Borough document with headers.
Very much appreciate that work you are putting into this and the frank back and forth.
Doug

On Wed, 5/25/16, Michael Miller <mgmiller@barnard.edu> wrote:


Subject: Re: By Borough
To: "Doug Johnson Hatlem" <djjohnso@yahoo.com>
Date: Wednesday, May 25, 2016, 7:17 AM
Doug,
Of
course there are possible explanations. One is that one
campaign focused
mobilization efforts in
those precincts. Another is that one campaign's
supporters were more willing to bear long lines
in crowded precincts. These
are the
"benign explanations" that Walter refers to in his
emails. As long
as such things are
plausible, fraud is not the most likely explanation.
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 25 at 11:21 AM
To
Doug Johnson Hatlem
Michael Miller

Message body
If you want to make the case why they are plausible with specific facts related to the NYC campaign or
what we reasonably know about Clinton v Sanders supporters, I'll be happy to discuss. I'd very much
like to hear why Clinton targeted voters everywhere in NYC but Manhattan or why, against evidence
sometimes in video form from multiple states, Clinton supporters are more likely to stand in long lines
(which by the way, there were virtually no reports of such in NYC). Again, if I report on this, I will
include *all* your objections, but they will not be taken without comparison to actual data, facts, etc.
That's simply how I report and function as someone who was on multiple national championship
winning switch side college debate teams. Yes, address all arguments. No, a weakly supported
argument doesn't go very far. Always the goal is to address the *very strongest* part of a competing
argument.
Doug

Re: By Borough
People
Michael Miller <mgmiller@barnard.edu>
May 25 at 11:40 AM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Walter Mebane
Doug,
If I can be frank, it's not my job to make a case. I don't have a dog in the fight, and I don't much care
what the outcome of an analysis is here. You asked me whether I thought there was fraud. And in my
professional opinion, there is no clear evidence of fraud in this election given the data I have seen
either in the precinct data you supplied or in your report. I believe in that conclusion even more in light
of Walter's analysis.
I do, however, now have serious questions about the objectivity of this reporting. It really seems to me
that you're going to report that the election is fraudulent no matter what we say. Doing this kind of
work takes time. So, since in my view my findings will not be fairly treated if they turn out to be null, I
will not be conducting any additional analysis here. I am also clarifying that any findings I have
reported to you should be deemed exploratory and not final. I do not authorize them for public release.
I do not give you my permission to use my name, institutional affiliation, or credentials in your report,
in any circumstance, even to contradict your claims. Good luck with the story.

Re: By Borough
People

Doug Johnson Hatlem <djjohnso@yahoo.com>


May 25 at 11:48 AM
To

CC

Michael Miller
Stewart M. McCauley

I do not agree to retroactively taking things off the record.

I have not determined how I will report on this. I am most sorry that you don't appreciate healthy give
and take on the details of particular theories. If you, Walter, or any other statistician makes a
compelling case, I will use that case as part of debunking certain fraud theories, just as I have here,
with reference to four particular states:
Hillary Clinton vs. Bernie Sanders: Debunking Some Election Fraud Allegations
I care about truth as much as I do about justice. If you have something to offer in terms of evidence,
argumentation, and supportable non-fraudulent theories ... which clearly you do ... I take those things
very seriously. My reporting bears that out and will continue to do so.
Doug

Re: By Borough
Michael Miller <mgmiller@barnard.edu>
May 25 at 11:58 AM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
Do not contact me again. If my name is used in your story I will refer the issue to my university
counsel.

Re: By Borough
People

Doug Johnson Hatlem <djjohnso@yahoo.com>


May 25 at 12:24 PM
To
Michael Miller
CC

Stewart M. McCauley
xxxx@comcast.net

Fair enough. This will be my last email to you. And your name *will be* used in an article given
this threat (weak as it is). That's how fair but tough adversarial journalism works. I've cc'ed my
editor so he can see the thread.

Doug

Re: By Borough
Michael Miller <mgmiller@barnard.edu>
May 25 at 12:39 PM
To
Doug Johnson Hatlem
CC
Stewart M. McCauley
sitka@comcast.net
I like your style Doug. Let's dance.
[at this point, Prof. Miller began his 25 tweet description of the previous conversations]

Re: By Borough
Doug Johnson Hatlem <djjohnso@yahoo.com>
May 25 at 1:10 PM
To

Doug Johnson Hatlem


Michael Miller

CC
Stewart M. McCauley
http://giphy.com/gifs/redoaks-red-oaks-xT1XGJVARw89Dr3ifu

Você também pode gostar