Best way to format data for supervised machine learning ranking predictions

https://datascience.stackexchange.com/questions/5367

16-10-2019
|

Pergunta

I'm fairly new to machine learning, but I'm doing my best to learn as much as possible.

I am curious about how predicting athlete performance (runners in particular) in a race of a specific starting lineup. For instance, if RunnerA, RunnerB, RunnerC, and RunnerD are all racing a 400 meter race, I want to best predict whether RunnerA will beat RunnerB based on past race result information (which I have at my disposal). However, I have many cases where RunnerA has never raced against RunnerB; yet I do have data showing RunnerA has beat RunnerC in the past, and RunnerC has beat RunnerB in the past. This logic extends deeper as well. So, it would seem that RunnerA should beat RunnerB, given this information. My real concern is when it gets more complicated than this as I add more features (multiple runners, different distances, etc), and so I'm turing to ML algorithms to help my predictions.

However, I am having difficulty figuring out how to include this in my row data that I can train (after all, correctly formatting data is 99% of proper machine learning), and I am hoping that someone here might have thought along the same lines in the past and might be able to shed some light.

Example:

I am currently trying to include RunnerX-RunnerY past race data by counting all the races that RunnerX and RunnerY have run together and normalizing them on a scale from -1 to 1; -1 indicating RunnerX lost all past races against RunnerY; and +1 indicating that RunnerX has won all past races against RunnerY; and +1 indicating. And 0 indicating an equal number of wins and losses (or no past races against each other).

For instance, if RunnerA is racing RunnerB, and RunnerA has beat RunnerB in the past, then I want the algorithm to know that (denoted by a +1 on the RunnerB column of row RunnerA); same for vice versa. Taking it another step further, If RunnerA is racing RunnerC (but the two have never raced each other in the past), and RunnerA has beat RunnerD in a past race, and RunnerD has beat RunnerC in a past race, then I want the algorithm to learn that RunnerA should beat RunnerC. I say beat here, but I mean an "average beat" for any RunnerX-RunnerY combinations when data for more than 1 past race is available.

I have set my data up as:

name     track   surface  distance  age    RunnerA   RunnerB   RunnerC   RunnerD
RunnerA  Home    2        400       11     0         1         0         1
RunnerC  Away    2        400       12     0         0         0         -1
RunnerD  Home    2        400       10     0         0         1         0

which shows that RunnerA has beat RunnerB and RunnerD in the past. RunnerC has lost to RunnerD. And RunnerD has beat RunnerC.

The problem:

The problem is that I don't really think this is a correct display of the information for an ML algorithm.

From what I understand, ML data should be row independent. And this data isn't because row 1 (RunnerA) has beat RunnerD, yet the data indicating RunnerD has beat RunnerC is in row 3.

Does anyone have any ideas how I might be able to incorporate this past win-percentage-for-runner-pair-combination data??? I'm totally stuck here. I've read a lot about some algorithms that estimate the win loss by simply totaling win statistics, but those don't say anything about the actual probability of a particular runner to beat another particular runner.

Any pointers would be super helpful.

Thanks!!!

Solução

This problem looks like a lot with the problem of ranking college football teams. I have never worked on this ranking problem, but I believe you can borrow some tools used there to build your model.

Here goes a couple of references:

Colley Matrix Rankings - This was one of computer rankings used by the BCS. It is also the only one that shared their methodology.

An example of the Colley Matrix Rankings - An easy example to follow.

The Perron-Frobenius Theorem and the Ranking of Football Teams - A well known reference that presents some ranking methods. This paper also shows how to assess the probability of winning a game based on the rankings of the teams.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange