I made some ramblings elsewhere, but here I just put short version. Splitting post. And omitting parts. Feel free to do the same.
@GnocchiPup said in #39:
> Self play to generate the NN.
Not generate but globally optimize the weight of the existing NN (from some initial condition in weight search space), which is what you meant, I assume. That does involve the evolving pair of NN-engine instances while producing games used in the global optimization scheme called deep reinforcement learning, or RL (or Deep RL).
> This provides the engine with the bias on which positions look good, and which to search first.
"This" the RL scheme itself provides such initial uniformly random bias (good term by the way, as preference or belief or prior), to start the sequence of self-play batches sequentially pairing 2 instances of the previous learning instance.
Each batch having only one instance allow to optimize its weights. (per my current understanding.. of course and as always).
So, as for humans, experience is about shaping the bias over many games. That is the 95% of our brain we could call the statistical brain, or our intuition processes, which we are rarely aware of during performance, it takes deliberate study of that, either professionally or by quirk of personality to get some idea about that.
So yes the first batch start uniform, but with self-play steps I just introduce, the new batch initial new bias in the last bias from the previous learned instance.
> During the actual game, self play Montecarlo search.
It seemed at first, to me, that you were talking from the whole process of training and playing. But it appears now you are going into some sequential description. So, I thought you were talking at this sept about the play mode, not the training mode.
The difference between the modes, I think (and might need correction or update), is that the NN weights are fixed to the last batch instance post-training phase. But whether with self-play (RL strict as I described), or in tournament/play mode games, given a NN weight "vector" data (all weights having a value, so an executable), which determines the "bias" function of state and action (position and move candidate upon it). Yes something is done to sample the probability function that the NN vector determines.
And that is done by some Monte-Carlo probability sampling method in context to tree searches. But looming over all of this, it the probability model that generate the games given the inputs and the NN data (the current bias function on the input).
The model represents the learned bias that optimizes the accumulated reward estimate if all the past games during training were representative of the chess world target of the whole scheme. right? anyone? So where does the "replay" stuff fit in there, is my mystery.
> This is where its WDL comes from and not an a priori.
The WDL are the statistics estimations per the NN model of that bias being evolved through RL over self play games. An evolved version of the uniform random prior policy over any positions (as a function defined over a domain of any positions, no need to construct all of them, to be able to say that). Well it stems from that.
possible new understanding:
I guess I see now. Directly, the NN does not spit out WDL probabilities, it spits out parts of the Bayesian theorem, like the evaluation and the policy (or probability vector of each possible move) at a given position.
We can only have WDL about game sets with their outcome end-points being tallied.
Fog here:
Somewhere in there the replay buffer is required and generated by some MC inspired tree search method, where all the decision nodes are visited**, by producing complete games where all the decisions are visited?
Decision nodes are not state nodes.
They are the pair of (state AND action) or (action GIVEN the state) = (move GIVEN the position).
That is a more fundamental mystery of mine. The relation between the decision tree and the game world or position world (or both). not the same kind of nodes. I am able to consider nodes that are full position information instances each of them, and also action being links through some of those, games being full paths from the initial position node (not decision) that actually visit legal links between legal positions. The position world contains the nodes which the links can connect as legal transitions, and the legal games are those paths. Directed links. Now one can put a local coordinate system on that position world, following a player or 2 players, but lets focus on the one player having only game to think about during that game.. yes, if it had full knowledge, it could consult an extracted data structure from all the paths i just described, that we could call a decision tree. As the player actually makes decision, a turn at a time, it is traversing such a tree. But the ambient space I just evoked is where the probability law determined by the NN weights is defined, that which generates the games. (but rigged to satisfy some global functional reward optimization problem, which is a function of the W D L probabilities of the full experience set embedded in the learned NN, something like that). I don't recall the math. but the W D and L enter the objective function (one can toy with the exact algebraic formulation, that is part of the ML art).
I guess somewhere in there, one can figure out WDL frequencies from game sets and that in turn gives some confidence interval termination criterion to test, in the MC-TS type algo, doing the search. So what is the full event statement that W D or L represent. W? The probability of winning at the given position given what set of games (the NN resampled full games with enough visits the decisions nodes determined by some confidence interval test . Can anyone fix my understanding as I am struggling to verbalize here?
> In theory a SF 0.0 would produce a WDL of 0 100 0, but the same position could look very different for Lc0. Could be 10 80 10, could he 30 60 10. Both cases, best play is draw, but second case is harder for Black. I have a position in mind to test this, might do a post in the future.
This is WIP per your personal inbox communication. with a specific position...