]> Statistical Physics

Quantum Distribution

There's a treatment of statistical physics, phrased in terms of a quantum model of the system, which bears describing quite generally without reference to the large number of particles or indeed temperature normally taken for granted in thermodynamics. This page began as an attempt to express quantum thermodynamics starting with that treatment, but got too big; it's now just the theory for that treatment, with the application to large numbers of particles shunted out to one other page and the treatment of the case where energy is an observable as another.

Preliminaries

In quantum mechanics we model any system by a state vector, ν, in a Hilbert space, S, on which we have a non-singular Hermitean metric (dual(S): s |S) with s(ν, ν) = 1. [Hermitean means s is antilinear – for complex scalar k and any v in S, s(k.v) = *k.s(v), where *k is the complex conjugate of k – and as symmetric as that allows it to be – for u, v in S, s(u,v) is equal to the complex conjugate of s(v,u).] The quantities we can know (observables) are expressed as linear maps (S:|S) whose composites before s are Hermitean; the ones we do know (i.e., not only are they observable, but we have observed them) necessarily commute with one another. This implies that they are diagonal with respect to some basis which unit-diagonalises s. Call that basis (S: b |dim) and its dual (dual(S): q |dim), with dim the dimension of our Hilbert space; then

in which ({scalars}: * |{scalars}) is complex conjugation; and each q(i) is in dual(S) = {linear ({scalars}: |S)}, so *&on;q(i) is antilinear ({scalars}: |S).

Thus we have a family of observables, which we can model as a set Observed of linear maps (S: |S); and we're given the values of s(ν, X(ν)) for each X in Observed; we thus have a mapping ({scalars}: Val |Observed) defined by Val(X) = s(ν, X(ν)). Now, each X in Observed has s&on;X Hermitean, whence s(X(ν),ν) = s(ν,X(ν)) is necessarily real (i.e. equal to its own conjugate), so actually we have ({reals}: Val |Observed).

One observable always available to us is the identity, which commutes with all other observables. Any observation of the identity renders an answer of 1, which expresses the above-mentioned truth that s(ν, ν) = 1, so we include the identity among Observed as a way of encoding this constraint. Since S, as a collection, is synonymous with its identity, we thus have S in Observed. For convenience, take J = {X in Observed: X is not S}, so that Observed = J&unite;{S} and S is not in J. Thus J is the collection of real physical observeds.

Note that J will typically be a collection of observables which contain much information about our system: if we identified an observable (compatible with the others) which was reasonably closely correlated with some pattern in that information, we'd be observing it. Consequently, there's a sense in which we should expect to find little information content in the degrees of freedom remaining to the solution, given the data actually observed. This is an anthropic contribution to the situation.

We have as many degrees of freedom as we have members of dim (which is typically infinite): we have as many constraints as there are members of Observed (which is typically finite). There are typically plenty more degrees of freedom than constraints.

Spreading it around

When we have more degrees of freedom than constraints, we can expect to find many solutions; our system is capable of being in any of some very large collection of actual states. When we have many solutions (especially when we have so many that Avagadro's number doesn't seem big), the study of the situation is called Statistical Physics.

We then have to chose among the solutions. In our quantum world, we can form superpositions of solutions in suitable ways to produce other solutions: indeed, between solutions we expect there to be scattering channels by which a solution at one moment evolves, over time, to form a superposition of solutions, in which the original is, possibly, a contributor.

The appropriate tool for doing superpositions of solutions is a distribution (or measure) on the solution-set, E, of our equations: formally, this is a linear map ({scalars}: :{mappings ({scalars}: :E)}), which may be though of as integrating over E. We can always extend a distribution, m, to also serve as a linear map (V: :{mappings (V::E)}) for any linear space V, by taking components in V, applying m to the scalar functions from E that result, then using the resulting scalars as components in V to reconstitute an answer. For a given distribution, m, on E and any mapping ({scalars}: f :E), we can form a new distribution on E, which I'll write as f@m, defined by:

On a discrete set, there is a standard distribution, sum, which simply adds together the outputs of the function: and any function f from the discrete set to scalars thus yields a distribution f@sum. A superposition of solutions has the form m(S: e←e |E), the integral with respect to some distribution m of E's natural embedding in S. The available scattering processes will imply some kind of redistribution over E, collectively implying a flow dynamics for m whose character we know poorly save that it is jam packed full of randomness. The basic idea of statistical physics is to reason that the actual superposition of solutions we see will be as random as it can get while yielding a solution (i.e. a member of E) as its integral of (S: e ←e :E).

Now, two distributions on E may produce the same member of S as their superposition: and the form our constraints take becomes complex and unhelpful. However, each member of E must be some superposition of basis vectors (because b spans S), so any superposition of solutions to the equations is equally a superposition of b: and two distributions superposing b to the same member of S must be equal (because b is linearly independent), so the ambiguity in among distributions on E presents no problem. Because b diagonalises our constraints, it will render them into a simpler form than E could offer.

We thus deal with (S:b|dim) rather than (S: e ←e|E), so the corresponding tool for superpositions becomes a distribution on (|b:), which (since b is monic) amounts to a distribution on dim. We already have a standard distribution on dim, namely sum (if dim is discrete; if continuous, we need to use integral in place of sum; but this makes no difference to the form of the discussion). As noted above, any mapping ({scalars}: f :dim) yields f@sum as another distribution on dim. [When dim is a continuum, some distributions on it might not be of form f@integral; but we suppose that distributions we actually want to use will be of this form.] We can thus identify any distribution with the ({scalars}: f :dim) for which f@sum is the given distribution; indeed, doing so is exactly the process of taking coordinates in S, using b as basis, of the superposition that results from the distribution. So, now, let's describe an arbitrary solution in terms of its co-ordinates, and see what form our constraints take.

Co-ordinates and randomness

We can use our bases, b and q, of S and dual(S) to express ν in co-ordinates: define ρ = ({complex}: q(i)·ν ←i |dim) and we obtain ν = sum(S: ρ(i).b(i) ←i |dim). Since q unit-diagonalised s, we have s(b(i)) = q(i), so we can infer s(ν) = sum(: *(ρ(i)).q(i) ←i |dim). Since b and q also real-diagonalise each X in Observed, we can define

whence

Val(X)
= s(ν, X(ν)) = s(ν)·X(ν)
= sum(: *(ρ(i)).q(i)·X(ν) ←i |dim)
= sum(: *(ρ(i)).Ob(X,i).ρ(i) ←i |dim)

so define

r = (: ρ(i).*(ρ(i)) ←i |dim)
noting that its outputs are non-negative reals, and obtain
Val(X) = sum(: r.Ob(X) :)

From S in Observed, as the identity, we get Ob(S) = (: 1←i |dim) whence 1 = s(ν, ν) = sum(r), and we know r's outputs are non-negative, so each output of r is at most 1 (at least when dim is discrete). This makes r@sum a (candidate to be used as a) probability distribution: equally one may think of r as indicating the proportions in which we are mixing the base states represented by b when we build up ν (but this should not be treated too literally, since r lacks the phase information of ρ).

For each i in dim we have b(i)×q(i) in {linear (S:|S)} and composing it before s yields q(i)×(*&on;q(i)) which is Hermitean, so b(i)×q(i) is an observable (in the technical sense, not necessarily a physical one); it also commutes with each of X in Observed, since these are diagonal. Its expected value in state ν is simply r(i), and quantum mechanics regards it as the probability of observing a state (previously) in state ν to be (when observed) in state b(i). In principle, at least, r is consequently observable.

Now, r is defined in terms of b, our basis, but none of the definition of either cares about any relationships among the labels (members of dim): so we should likewise consider all members of dim independently. Our constraints would complicate that, but the optimisation technique I'll be using allows us to ignore the complication until we've done the optimisation.

We need a quantifiable notion of how random r is: information theory comes to the rescue, with sum(: r.log(1/r) :), that is sum(: r(i).log(1/r(i)) ←i :). This is the information-theoretic entropy of the distribution described by r. The probability of getting distribution r at random depends on r only via its entropy, and increases with it. [Aside: log(x)/x tends to zero as x tends to infinity (because log grows slower than k.x for any positive k), so t.log(1/t) does likewise as t tends to zero (from above), so treat 0.log(1/0) as 0 if r takes the value 0 anywhere.] So we can use f(r) = sum(r.log(1/r)) as the function we aim to maximise, within our constraints.

Our constraints are (: sum(r.Ob(X)) ←X |Observed) = Val, which has the form g(r) = given required to use a standard optimisation technique. [K(i) is the value Val would have if ν were b(i).] Introduce K = transpose(Ob) = (: (: Ob(X, i) ←X |Observed) ←i |dim), so K(i,X) = Ob(X,i), and obtain

g(r)
= (: s(ν, X(ν)) ←X |Observed)
= (: sum(: r(i).K(i,X) ←i |dim) ←X |Observed)
= sum(: r(i).K(i) ←i |dim)
= sum(r.K) in {mappings ({reals}: |Observed)}

We'll be looking for some r for which f'(r) and h&on;g'(r) are parallel for some linear ({reals}: h :{mappings ({reals}: :Observed)}), which is a distribution on Observed. Given that Observed is finite (so that we can sum over it, for instance), we have a measure called sum on it already, and we can express any measure on Observed as a scalar function by which to weight its members when summing over them. This leads us to some ({reals}: H |Observed) equivalent to h as follows:

Solution

Any sufficiently small change, d, in r produces a change in g of near enough g'(r)·d, which must be zero if g(r) and g(r+d) both have the correct value. At a stationary point (e.g. any maximal or minimal point), no allowed displacement changes the value of f, which requires that every d with g'(r)·d zero also has f'(r)·d likewise zero. This implies that f'(r) can be factorised as −h&on;g'(r) for some h, as above (I could equally have used f'(r)=h&on;g'(r), but we have a choice here and f'(r)+h&on;g'(r)=0 gives H positive when we work out the details).

Now, (: t.log(1/t) = −t.log(t) ←t :)' = (: −1 + log(1/t) ←t :) so differentiation with respect to r(i), for any given i in dim, turns f(r) = sum(r.log(1/r)) into −1+log(1/r(i)) and g(r) = sum(r.K) into K(i) = (: Ob(X,i) ←X :). We have h&on;g'(r) = h&on;K = (: h(K(i)) = sum(: H(X).Ob(X,i) ←X :) ←i :) = sum(H.Ob), so we obtain 1+log(r(i)) = sum(: H(X).Ob(X,i) ←X :) or

Now, S is the identity, and each s(b(i), b(i)) is 1, so Ob(S, i) is always 1 and we have log(r(i)) = H(S)−1 + (sum(:H.Ob:J))(i), so the constraint given as S becomes 1= sum(r) = exp(H(S)−1).sum(: (exp&on;sum)(:H.Ob:J) |dim), so exp(1−H(S)) = sum(: exp&on;sum(:H.Ob:J) |dim) so we are able to express one of our constraints as a constraint on H(S), in terms of the remaining (:H:J) independent Observeds.

Pause to consider whether we got a maximum: differentiating f with respect to first r(i) then r(n), for n, i in dim, we get r(n)'s derivative of −(1+log(r(i))), which is zero unless i=n, when it is simply −1/r(i), which is negative; g's derivative is independent of r, so its second derivative (hence that of h&on;g) is zero. Thus f+h&on;g has negative definite second derivative with respect to r, implying that our stationary point was a maximum (when you move far enough away from r to get a change in f+h&on;g, it'll be negative no matter what direction you took), which is what we wanted.

Further analysis

Introduce

Our solution now becomes r(i) = zed(H,i) / exp(Y(h)), with both zed and Y ignoring H(S) so, in effect, H is only defined ({scalars}:|J) from here onwards; and h ignores the S = 1 component of Val. Pause, now, to consider the derivative of zed and, thereby, of Y.

For any X in J, differentiation with respect to H(X) turns log&on;zed(H) = (: sum(:H.K(i):J) ←i |dim) into (: K(i,X) ←i |dim) = Ob(X), hence zed(H) into Ob(X).exp&on;sum(:H.K:J). Consequently, exp(Y(h)) = sum(: zed(H) |dim) has derivative, with respect to H(X), sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim). Now, from g(r) = Val,

Val(X)
= g(r,X)
= sum(:
r(i).Ob(X,i)
= Ob(X,i).exp(sum(:H.K(i):J)) / exp(Y(H))
←i |dim)
= sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim) / exp(Y(h))

This just divides our derivative of exp&on;Y by exp&on;Y, which is thus the simple derivative of Y with respect to H(X), so (:Val:J) is Y's derivative. Thus

We have exp(Y(h)) = sum(: zed(H) |dim) and, for each X in J ⊂ Observed, X = sum(: Ob(X, i).b(i)×q(i) ←i |dim) so

sum(:H.J:)
= sum(: H(X).X ←X :J)
= sum(: H(X).sum(: Ob(X,i).b(i)×q(i) ←i |dim) ←X :J)
= sum(: sum(: H(X).Ob(X,i) ←X :J).b(i)×q(i) ←i |dim)
= sum(: log(zed(H,i)).b(i)×q(i) ←i |dim),

which is diagonal with respect to our basis, hence formally an observable. In particular, we can define exp() on {linear (S:|S)}, and it maps one diagonal mapping to another, mapping each diagonal entry to its own exp() as a scalar: so exp(sum(:H.J:)) is just sum(: zed(H,i).b(i)×q(i) ←i |dim) = sum(: zed(H).b×q :dim) and its trace is exp(Y(h)). Let Z = exp(sum(:H.J:)) = exp(h(J)) and observe that Z is defined independently of our choice of basis, b (albeit we compute it using that basis) and is an observable compatible with all those in Observed. We have Y = log&on;trace&on;Z so, again, Y doesn't depend on our choice of basis (whereas zed manifestly does – it's the co-ordinates of Z wrt b, q).

Final entropy

It is interesting, now, to look at the final value of the function whose maximum we've found. That's

f(r)
= sum(r.log(1/r))
= sum(: r(i).(log(1/zed(H,i)) + Y(h)) ←i |dim)
= Y(h).sum(r) −sum(: r(i).log(zed(H,i)) ←i |dim)
= Y(h) −sum(: r(i).sum(: H(X).Ob(X, i) ←X :J) ←i |dim)
= Y(h) −sum(: H(X).sum(: r(i).Ob(X, i) ←i |dim) ←X :J)
= Y(h) −sum(: H(X).Val(X) ←X :J)
= Y(h) −h(Val)

which is manifestly independent (because Y and Z are) of our choice of simultaneous eigenbasis for our observables, even though f was defined in terms of that basis: the final value depends only on the observables and their values. Furthermore, h(Val) = h·Val = h·Y'(h), so Y(h) −h(Val) is just the estimate we would make, by linear extrapolation from Y's value and derivative at h, of the value of Y(zero).

Lack of large numbers

Thus far the only time I've needed many of anything is hidden inside how I assess the randomness of the function r. The derivation of the measure of randomness, as sum(r.log(1/r)), involved making up r of a large number of dollops. Its validity may involve more measure theory than I know, but the dollops of which it needs large numbers are purely a fiction of the construction: they make no pretence at having anything at all to do with anything physical. Any distribution which arises as a result given one number of dollops also arises for at least some larger number of dollops (e.g. any multiple of the original number), indeed the distributions made of many dollops are taken to be dense among the possible distributions. Since, for these distributions at least, we know the probability will depend on the value of r only via sum(r.log(1/r)), we infer that the same will apply to any probability distribution we manage to infer on the space of (: distributions :{mappings ({scalars}:|dim)}) will depend on position, r, in this space, only via the same sum: which may be evaluated without reference to how many dollops were involved. So the only large number needed thus far was used in an abstract context to justify using a measure of randomness which can be evaluated without reference to the large number in question.

Thus the conclusions derived here apply to any quantum system, even if it consists only of one molecule or, indeed, a single atom. In so far as its internal dynamics permit it various states with its observed properties, we can expect it to be in an entropy-maximising superposition, of these states, of the kind described here.

Insensitivity to redundant observables

Suppose we add a new observable, M, to J, but that it's entirely determined by the other observables, i.e. its expressible as a function of the observables already in J. Intuitively, this shouldn't make any difference; so let's see what the theory predicts.

We have Val(X) = ∂Y/∂H(X) = ∂(log(sum(: exp(sum(H.K(i))) ←i :dim)))/∂H(X) and we're adding an extra term to the sum(H.K(i)), H(M).K(i,M). Our partial differentiation formally treats H(M) as being independent of the other H(X), with X in J, so this won't change Val(X) for X in J; but shall imply a value for Val(M) which we should be able to infer from the other Val(X). Each K(i) gets a new component, K(i, M), which can be expressed as M's function of the other K(i, X) with X in J; we get H(M) as multiplier for this new component.

Summary

What's shown above is that:

Notes:


Valid CSSValid HTML 4.01 Written by Eddy.