REINFORCE-ES applied to quadruped environment
Model very quickly converges to actors that adopt static positions and stay in those positions. I think the reason this happens for this model and not the openai ones is because here the model is outputting the positions of the motors rather than there torques. This we're much more likely to see static poses as each of those is at least a stable solution to the problem, especially if there is a fall penalty. Using torque means the network governs rate of change which means it's more likely to suggest dynamic solutions rather than static ones. This essentially injects noise into the model which acts as exploration.
Couple of potential solutions:
Other things:
Appears the torque applied to the motor wasn't strong enough which meant the default outcome was for the quadruped to fall over and be unable to get back up.
So the network outputs and extremely volatile signal. I checked to see if this was the same as in the ant case and it was. In fact evo-ants network is basically a very chaotic signal that's then squashed down onto [-1, 1] by the tanh output layer into essentially a square wave with different frequencies. I think our issue is that if a network output for a specific action is distributed around a mean sufficiently far outside of [-1, 1] then we end up with a signal that's mostly just takes value 1. The parameter updates are unlikely to move this signal by much and the end result will likely be that the model can't learn anything for that signal. This is the notable difference between ant and quad network outputs, namely that the mean of the network outputs is around 0 for the ant but outside of [-1, 1] for quad.
So the robot does a pretty good impression of trying to run. However I think the weights are all completely off as are the frictional constants, hence it sort of gallops on the spot.