Você está na página 1de 13

Coverpage Assignment

Name of the Tutor:

Dorothea Berglar

Course code: Course Title: Course Coordinator: Tutorial Group Number: Student Name: Student ID nr.: Academic Year: Number of Words:

Title Assignment: Pigeon Hole nr.:

COR 1005 Modeling Nature Lonneke Bevers 05 Julian Koch 6029285 2 A provocative 3000 and 1 words (excl. footnotes & references) Calculating Footbal Results 104

Date:

Signature:

______________

______________

With great thanks to: Dennis Klpping!

Calculating the probability for a football team to win in their next game Introduction Football, or as Americans know it, Soccer, is probably the most popular sport on the planet. Football teams can be worth the GDP of small countries (e.g. that of Samoa) and the official market is worth billions (International Monetary Fund, April, 2010 & Gage, Jack, Aug. 2009). An additional portion of the market is the largely illegal betting industry, which in Germany alone is estimated to be worth about 5 billion Euros (Reuters, June 22, 2011). Besides the monetary value, there is the ideational value that the world-wide fan community attributes to the various teams they support. Sometimes one game makes the difference between a successful title-winning team and one that is the everlasting runner-up. All these interest groups, trainers, team managers, fans, gamblers and gambling bureaus have their stakes in whether their team wins or not. For this purpose this paper seeks to develop and introduce an equation with which to calculate the result of the next game against another particular team (in club football). The scope of the equation is limited to the next game only, because there is little open literature and previous research on this approach, and the little that exists solely focuses on maximizing betting benefits (Soccerwidow, March 31, 2011).1 International football associations use different methods and equations to calculate team strengths. All of these, though, are descriptive and not prescriptive perceive it to be rather than making predictions therefore they describe reality as they

and are usually valued by the respective

associations for their relative simplicity, which also implies transparency for the international community. This relative simplicity makes the calculation prone to certain quirks. For example the FIFA World Ranking of national teams usually changes very abruptly 12 months after a world cup or continental cup, since after this period the weight of the matches during the cup is immediately halved (FIFA (n.d.)). With world cups and continental cups weighing especially much, this can lead to quite some turmoil in the ranking, depending on the performances of the cup winning teams and runners-up during the last 12 months. Thus a team can be at the top of the ranking 11 months after winning the world cup and loose its top-ranking position 12 months after the cup without having played a single game that could have negatively contributed to their change in rank during the intermediary month. The equation developed in this paper, as we will see, tries to address this

An additional reason for limiting this equation to calculating the next game is that a larger scope of the formula comes at cost of prediction reliability, for which it would also be meaningful to give margins of error, whose calculation method, the statistical means and the data cannot be provided with within the restrictions of such a paper.

problem amongst others (though in a slightly different context) by decreasing the weight of games exponentially as time passes and not suddenly, step-like. Variables Despite some of the apparent problems of the FIFA World Ranking, 2 the basic variables it uses are very similar to the ones used in the equation in this paper. Incorporated in its calculations are the match results, the match status (if it was a friendly match it won t count as much as a match in a tournament), the opposition strength according to its world ranking at the time, the region weight of the two teams,3 and the period of assessment.4 Match status and regional weight won t play a role in the equation of this paper, since all league games contribute equally towards the final result thus

being equally important , and since we are only interested in the opponents relative strength to the strength of the team we want to calculate the probability to win for. Nonetheless, all the other variables, match result, strength of opponent and the period of assessment, play an important but differently weighted role. As the formula developed in this paper is a prescriptive one and also aims to smoothen out some unevenness of other calculation methods, other variables will have to be added in order to make reliable predictions. Of these, location is probably the most significant one it plays a huge role

whether a team plays at home, where it gets the support of his own fans, or away, where the other team gets support of its fans. The concrete value of this variable can be established by looking at how the team has performed over the years when playing at home compared to playing away. A big enough sample size should adjust for possible distortions by other variables, such as exhaustion or opponent strength (as it should, inversely, for exhaustion and other variables as well). Exhaustion also plays a role in how a team performs multiple games in a week give the team little

time to deal with the results psychologically, recover physically and improve possible faults through training. The concrete values of the variable exhaustion could be established objectively , by setting a measure that playing more than e.g. one game per week leads to exhaustion and look at empirical data of how the teams concerned have previously performed in such exhaustive weeks. Another mode to establish exhaustion could be to ask the players how fit they feel on the night of the game, scale the answers according to a scheme and average them. A very insignificant variable in terms of concrete value, but quite important one concept-wise, as it allows for variation by chance, is underestimation. If one looks at the seasons that big football
This also goes for most other rankings they usually work with the same principles. Regions are the continents from which the two playing teams come. These are weighted differently according to another formula, which is of no further concern here (Edgar, March 10, 2010). 4 In the FIFA World Ranking it spans the last four years (FIFA (n.d.))
3 2

teams such as Real Madrid or Barcelona play, then one will notice that they commonly lose or play a tie only against other big , or very weak teams, which they would win against if they had played with more concentration. It can be assessed in similar manner as the above variables and different values for underestimation can be assigned corresponding to the opponent s strengths.5 6 The last additional variable is how the particular team has played against the upcoming opponent in the last four games. Some teams almost have a tradition of beating big teams, despite themselves being a merely mediocre team. These football-related peculiarities could have a big impact and should not be ignored. Only the last four games are chosen, because squads especially those of top

teams tend to change a whole lot throughout two seasons, and thus including results from three or four seasons ago would not make much sense. Equation Given these basic variables and changes to the established calculation methods, the formula can now be developed and looks like this:  With being exhaustion of team 1, and  

being exhaustion of team 2. The same goes for location

(home or away) L, and underestimation I. Their restricted domain is:   If is 1 then chances of underestimation are 0. If is 0.99 then chances of underestimation are at 1%. Experience tells that should be very high and therefore the risk of underestimation very small.

should be proportionally related to the strength of the opponent, since the risk of underestimating a opponent should become smaller the higher his strength. Exhaustion means how many percent of its current form the team has available. Thus that your team is tired and can only access 80% of its current form. should have a value of 1 for a home team, and a value lesser than 1 for an away team, since playing away lessens the probability for a team to win. All variables which can theoretically take quite small values (0.5 each) should be generally close to 1. It is unlikely that exhaustion, location and underestimation actually account for more than 50% of the actual form and strength of the teams.
Roughly, underestimation should be anti-proportionally related to opponent s strength when his strength rises, then underestimation should fall and vice versa. 6 It should be noted that all these variables can be seen as being constituted by sub-formulas and sub-variables      but because of the limitations for exhaustion it could e.g. look like this of the paper, the above variables of underestimation, exhaustion and location are regarded as externally given.
5

means

signifies how the team of interest has played against the upcoming opponent in the last four

games. It is the sum of the match results divided by the number of matches (4). If the team of interest has won all games against the other team, then for this respective team the value will be 1, if it has had ties throughout the value will be 0.5, and if it has lost all games then the value is 0. The formula for calculating is:     is the game before the last game is
 

And the restricted domain is:



The index

signifies the time level of the last games against the upcoming opponent. Thus

the previous game against this particular opponent, and against this particular opponent.

The most crucial to understanding this equation are the factors a and b . Each of these factors represents the form a team is in, with a being the form of the team of interest, and b being the form of the upcoming opponent. They are constituted by the sub-equations:
 

 
y,z (as well as the x in the equation of attain values of 1, 0.5 and 0. Hence: 
8

) are the variables for the game results, which can only

Both a

the team of our interest and b

the upcoming opponent of our team of interest add

up all the game results of the previous 15 games running from the first game (j=1) to the 15th (n=15). A rational team, though, would not mind it too much to lose against another team that seems to be beyond its strength. On the other hand, if that team then wins against the much stronger opponent, this is a special success! The same in reverse applies if the team of interest is very strong and has performed particularly well against the opponent. Another win against such an opponent would thus simply be business as usual and should not count very much. To achieve such a weighting of the

The reason for writing S and not just is that this domain is valid for all the other forms of S in the equations below. 8 These numbers are arbitrarily chosen for their simplicity, and any other positive value will do as well, as long as the relative distance between victory, tie and loss remains the same. If changing the values of x,y,z one has to adjust the constants K and C or change the interpretation of measures for the outcome result (R).

game results according to the strength of the opponent, y or z interest or its upcoming opponent is concerned

depending on whether the team of (the indexes

team in the last four times they have met on the pitch (similar to

 will be explained below). S is how a certain team has performed against another particular ). But what is of importance in

is multiplied with

 and

weighting the game results is not how good the team concerned has performed against this particular opponent in the previous 4 matches, but how the particular opponent has performed against the team concerned the strength of the opponent. If we multiply y or z with S, then we

achieve the opposite of what we want: If team 1 has performed especially badly against team 2 in the past, then S is 0. If team 1 then has won in the next meet-up then y is 1. 1 multiplied with 0 is still 0 and thence team 1 reaps no benefit from beating a very strong opponent. Yet, if we multiply y or z with exactly the desired result is attained: A win of team 1 with strength of -> the maximum payoff. and in


against

team 2 is rewarded with The index in

signifies the time level of the last games. As i increases, one and thus this means the last game. If ,


goes back in time by one game if i is 1 then the index is

i is 2 then this means the game before the last game and so on. The same applies for the index except that the j indicates that another team than the one with i is concerned.

for example


is therefore simply the strength of a team against an opponent it has played a game ago. would then be the corresponding game result of one game ago. Now we have the sum of the game results of the last 15 matches weighted according to the opponent s strengths of the last 15 games. The sum implies that all the games something that

contribute equally much

does not reflect reality very well. Football is short lived, most teams can barely remember the results of 8 matches ago unless they are

on a winning or losing streak, for which reason the equation takes into account the last 15 games and not much less. The most important result is always the last result and maybe the one before that. There is reason to believe that the further one goes back in time, the more unimportant the game gets by an exponential degree (as portrayed in Graph 1.0). In order to Graph 1.0: Assuming y or z to be 1, and S to be 0 thus all the factors preceding are 1 , the graphs for different values of d have been plotted here for an increase in i, which is plotted at the x-axis. The output value is displayed at the y-axis.

mirror this mentality S and y or z are multiplied with a constant d with a restricted domain of:   This constant then has i or j as its respective exponent. Thus the further one goes back in time the smaller d gets, as it is a number between 1 and 0. The actual value of d can only be determined empirically something that needs careful evaluation of innumerable data and that can t be done in this paper. The last obstacle to be surmounted concerning a and b is that they so far consist of a sum adding up all the match results weighted according to strength weighted according to time elapsed. This sum now needs to be averaged lest it becomes very big. In order to do that, a and b are divided by the number of games played, which is i or j respectively. Now we can turn to K and C. K and C are scaling constants, which regulate that result does not go below 0 or higher than 1, as we have assigned no values to numbers going beyond those margins. Interpreting the results The dependent variable, which is the result of the equation (R), can take any value from 0 to 1 the

question now is: How should one interpret these values? Looking at the values x,y,z then a win means 1, a loss a 0, and a draw is 0.5. The same applies for R. The quandary is that nearly all the actual outcomes will be somewhere in between these three numbers. These numbers should be easily rounded off to the respective result, to do so, however, could mean a crude generalization. Obtaining a result of 0.76 means that according to the equation, a win for the team of interest is more likely than a draw; nevertheless one could be forgiven for having certain reservations about whether this win should be confidently expected, as 0.76 is still quite close to 0.5. Extreme results such as 0.95 or even 1, 0.05 or 0 are extremely rare when getting results as 0.9 or 0.1 respectively. Calculation of examples In order to base the equation on less shakier grounds, empirical results are needed. To be able to calculate a formula that needs so many input variables, a friend of mine has created a program based on the computer language C++, which reads out the result if the respective variables are fed to it. Because of the vast input of data the formula needs for i,j= 1, ,15, it was decided that i and j should only go up to 7 games. In order to be able to check the actual result, a game was chosen that had taken place in the recent past: Bavaria Munich vs Hoffenheim. a confident win or loss should be expected

In this example the team of interest is Bavaria Munich, whereas Hoffenheim is the upcoming opponent. The values of the variables E,L,I of the constants K,C and d were chosen as follows: y d had a value of 0.97 as it was deemed necessary that the maximum impact of form should at least be about 80% ( y ).

had a value of 0.9, because Bavaria Munich had had a tight schedule during that week and had played a game shortly before meeting Hoffenheim. had a normal work-week. was 1, as Hoffenheim simply had

y y

was 0.85 Bavaria Munich was playing in Hoffenheim whereas

was of course 1.

was 0.98, since Hoffenheim can be underestimated. The same cannot be said for Bavaria Munich: was 1. , K was .9

The constant C had a value of

The statistics from the previous games have been drawn from a German sports magazine database (Kicker, n.d.). The result the program calculated was - the equation predicted a draw,

though Bavaria Munich was, despite all complication, slightly favored. The actual game had ended in a draw as well, confirming the result of the formula.10 Limitations and outlook The biggest problem of the formula arises when it comes to measuring strength of a certain team. If a team has won every single match that is included in the formula, and the opponent lost every match, then the result is 0.7, even though the team should be expected to win the next match as well. Because the strength of a team is only assessed by looking at the last four games, and because victories against opponents one has always won against don t have any impact on the team (   ). This is more of a theoretical problem, though, than a practical problem, for

such cases generally are exceptions.11 Nevertheless, the problem persists in rare occasions: prediction of results for teams like Barcelona who enjoy almost total domination of the Spanish league, or Shakhtar Donetsk in the Ukraine will likely be around 0.7 or little higher. At a certain point

It should be mentioned that these constants were only approximated, the true constants have many more decimals, which explains also the figure of 0.7 instead of in Limitations and outlook . Other values have been tried out in order to check for how the equation reacts for more extreme values. Had Hoffenheim e.g. won all the last games (z always equal to 1), and Bavaria Munich lost all the last games including all the 4 previous games against Hoffenheim (y = 0 at all times, ) for the same values of E,I,L,d,K, and C, then the result would have been an overwhelming loss ( ) for Bavaria Munich. The exact other way around the result would have been an overwhelming win for Bavaria Munich . 11 In the calculated example, a team as strong as Bavaria Munich in the German league had only one single time of the past 7 games won in all 4 previous matches against an opponent (and that was a particularly weak one, for whom such a measure of strength could arguably be justified).
10

of domination predictions will start converging to 0.7. This paradox, however, cannot be solved in this paper, but should be noted here.12 Further problems are variables that this equation does not account for: Summer and winter breaks do not only give the team the chance to recuperate and get their E back to 1, but can give trainer and squad time to completely change mentality, fitness, discipline, tactics and so on. Squad and especially trainer changes can make a huge difference and cannot be included in the formula, because a whole rating system of trainers and players would have to be developed. One game does not help a manager very much, and this is surely the biggest objection one could have to this equation. However, there is nothing suggesting that the equation cannot be expanded with slight changes to estimate the probable results for the games of a whole season or other team sports for that matter. But this task will be left up to another modeler.

One solution is to simply evaluate the predicted results differently: e.g. everything above 0.75 would be a sure win and everything below 0.25 a loss. Nonetheless, this would distort the spectrum. It is then recommended to adjust the spectrum in such a way that 0.75 is a 1 (which is win) and a 0.25 a 0 (which is loss). The 0.5 would simply remain what it is, as it is in the middle of the distorted spectrum of 0.75 and 0.25 anyways. Any result would thus be evaluated in this manner: and any result above 0.75 or below 0.25  would simply be set to be a clean 1 or 0. An of a dominant team would therefore be thence, according to the above section Interpreting the results , a quite sure win. If these predictions then overall fit the actual results would have to be empirically tested. Another solution is to simply take into account more than just four games.

12

References: Edgar (March 10, 2010). FIFA Ranking: Confederation weighting factor calculation. [online blog]. Retrieved from: http://www.football-rankings.info/2010/03/fifa-ranking-confederationweighting.html FIFA (n.d.). FIFA/Coca-Cola World Ranking Schedule. Retrieved from: http://www.fifa.com/worldfootball/ranking/procedure/men.html Gage, Jack (Aug. 2009). Most Valuable Soccer Teams. Forbes. Retrieved from: http://www.forbes.com/2009/04/08/most-valuable-soccer-teams-business-sportsmoney-soccervalues-09-intro.html International Monetary Fund (April, 2010). Retrieved From: http://www.imf.org/external/pubs/ft/weo/2010/01/weodata/weorept.aspx?sy=2007&ey=2010&sc sm=1&ssd=1&sort=country&ds=.&br=1&c=862&s=NGDPD%2CNGDPDPC%2CPPPGDP%2CPPPPC%2 CLP&grp=0&a=&pr.x=76&pr.y=13 Kicker (n.d.). [online magazine and database of the Bundesliga results]. Retrieved from: http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2011-12/8/0/spieltag.html Reuters (June 22, 2011). Retrieved From: http://www.reuters.com/article/2011/06/22/germanygambling-idUSLDE75L1QF20110622 Soccerwidow (March 31, 2011). Practical guidance to recognizing VALUE Bets. [online blog]. Retrieved from: http://www.soccerwidow.com/en/2011/03/practical-guidance-to-recognizingvalue-bets-step-by-step-tutorial/

Appendix Rebuttal Some of the problems of the original draft were closely linked to the fact that I was still working on my formula and it kept changing as I discovered mistakes, thinking errors, wrong domains of variables or possible simplifications. Thus some of the feedback was actually concerned with obsolete parts of the equation. This goes in particular for the interpretation of the formula. It was originally designed to predict the probability of the team of interest to win the next game. The equation described above, however, calculates the most probable result (win, loss, draw) in the next game, according to existing data about the team of interest and the upcoming opponent. The interpretation error of my formula had been corrected already before the feedback arrived. Most of the feedback had dealt with how to actually build up the paper, which section should treat what content and how and to what extent variables and the values they can take should be explained. Some issues concerning the understandability of the formula have already been assessed by its radical simplification and improvement (compared to some earlier drafts). I decided to explain possible variables and how they accord with given literature and known statistics in the introduction. Then the broad formula was given right at the beginning of the main section. This was done, because the actual formula consisted so far only of some variables already explained and was basically an ordinary mathematical product with some additional constants. The more complicated parts then have been displayed further below and each variable has been treated and explained separately as recommended in the feedback. For some variables it would have gone beyond the scope of the paper to exactly explicate why certain values were chosen for them, thus either this has been left out or added in the footnotes (e.g. footnote 7). It seems to have been unclear at times whether the variables are discrete or continuous. I thought it to be obvious that the variables as well as the constants can take on any possible value (hence: continuous) except the ones explicitly excluded in the restricted domains, which is why this issue has not been incorporated into the paper. It had been recommended to give the actual values for the constants K and C instead of having them being denoted by their respective letter-symbol. However, these constants must be adjusted relative to the values that one considers especially d to take. Of course one could assume d to take its maximum value which is 1 and then choose such values for the constants that they cannot let the equation surpass any overall value of 1 or fall below 0. Yet, for pragmatic purposes and for more exact values of the formula it doesn t make much sense to choose the constants for values one does not intend d to take anyways. The whole reason to have an exponentially decrease in the weight of the game is exactly that one does not intend d to take a value 1. Therefore, one should choose a

value for or empirically determine d, calculate the maximum possible value and then give K and C values which insure that the equation result is within the scope of 1 and 0. Concern arose over how to measure underestimation. I would like to generally have given more exact accounts of how E,I,L are measured and what values they are likely to take in reality, but the scope of the paper forbids any clear depiction of this or of any conjectures concerning sub-formulas underlying the present equation. As far as the paper allowed, possible methods for the determination of the above variables have been adumbrated. The biggest quandary of the equation according to the feedback seems to have been the fact that up to the feedback, there hadn t been any empirical validation of the equation. This certainly is the central weak-spot of the whole model. There is one exemplary empirical calculation included in the model, but because of time and space issues more could not be included in the paper thus,

statistically speaking, the sample size for the confirmation of the formula (and its variables) is not nearly big enough. If needed I can provide the program with which the calculations have been done, but it should be noted that one needs at least two extra programs in order for the calculation program to work (the ones I used were Borland , a typical C++ compiler program, and notepad ++ in order to change the variables in the source code). The program has no graphical surface and therefore it might be slightly alien to somebody not familiar with programming languages (which is also one of the reasons why I haven t included the program in the appendix. The other is that the program is not my merit). This should answer the remaining questions and illustrate if and how the feedback has been processed and implemented.

Você também pode gostar