]>
There's a treatment of statistical physics, phrased in terms of a quantum
model of the system, which bears describing quite generally without reference to
the large number of particles
or indeed temperature
normally taken
for granted in thermodynamics. This page began as an attempt to express quantum
thermodynamics starting with that treatment, but got too big; it's now just the
theory for that treatment, with the application to large
numbers of particles shunted out to one other page and the treatment of the
case where energy is an observable as another.
In quantum mechanics we model any system by a state vector, ν, in a Hilbert space, S, on which we have a non-singular Hermitean metric (dual(S): s |S) with s(ν, ν) = 1. [Hermitean means s is antilinear – for complex scalar k and any v in S, s(k.v) = *k.s(v), where *k is the complex conjugate of k – and as symmetric as that allows it to be – for u, v in S, s(u,v) is equal to the complex conjugate of s(v,u).] The quantities we can know (observables) are expressed as linear maps (S:|S) whose composites before s are Hermitean; the ones we do know (i.e., not only are they observable, but we have observed them) necessarily commute with one another. This implies that they are diagonal with respect to some basis which unit-diagonalises s. Call that basis (S: b |dim) and its dual (dual(S): q |dim), with dim the dimension of our Hilbert space; then
in which ({scalars}: * |{scalars}) is complex conjugation; and each q(i) is in dual(S) = {linear ({scalars}: |S)}, so *&on;q(i) is antilinear ({scalars}: |S).
Thus we have a family of observables, which we can model as a set Observed of linear maps (S: |S); and we're given the values of s(ν, X(ν)) for each X in Observed; we thus have a mapping ({scalars}: Val |Observed) defined by Val(X) = s(ν, X(ν)). Now, each X in Observed has s&on;X Hermitean, whence s(X(ν),ν) = s(ν,X(ν)) is necessarily real (i.e. equal to its own conjugate), so actually we have ({reals}: Val |Observed).
One observable always available to us is the identity, which commutes with
all other observables. Any observation of the identity renders an answer of 1,
which expresses the above-mentioned truth that s(ν, ν) = 1, so we include
the identity among Observed as a way of encoding this constraint. Since S, as a
collection, is synonymous with its identity, we thus have S in Observed. For
convenience, take J = {X in Observed: X is not S}, so that Observed =
J&unite;{S} and S is not in J. Thus J is the collection of real physical
observeds.
Note that J will typically be a collection of observables which contain much information about our system: if we identified an observable (compatible with the others) which was reasonably closely correlated with some pattern in that information, we'd be observing it. Consequently, there's a sense in which we should expect to find little information content in the degrees of freedom remaining to the solution, given the data actually observed. This is an anthropic contribution to the situation.
We have as many degrees of freedom as we have members of dim (which is typically infinite): we have as many constraints as there are members of Observed (which is typically finite). There are typically plenty more degrees of freedom than constraints.
When we have more degrees of freedom than constraints, we can expect to find many solutions; our system is capable of being in any of some very large collection of actual states. When we have many solutions (especially when we have so many that Avagadro's number doesn't seem big), the study of the situation is called Statistical Physics.
We then have to chose among the solutions. In our quantum world, we can
form superpositions of solutions in suitable ways to produce other solutions:
indeed, between solutions we expect there to be scattering channels
by
which a solution at one moment evolves, over time, to form a superposition of
solutions, in which the original is, possibly, a contributor.
The appropriate tool for doing superpositions of solutions is a distribution (or measure) on the
solution-set, E, of our equations: formally, this is a linear map ({scalars}:
:{mappings ({scalars}: :E)}), which may be though of as integrating over
E
. We can always extend a distribution, m, to also serve as a linear map
(V: :{mappings (V::E)}) for any linear space V, by taking components in V,
applying m to the scalar functions from E that result, then using the resulting
scalars as components in V to reconstitute an answer. For a given distribution,
m, on E and any mapping ({scalars}: f :E), we can form a new distribution on E,
which I'll write as f@m, defined by:
On a discrete set, there is a standard distribution, sum, which simply
adds together the outputs of the function: and any function f from the discrete
set to scalars thus yields a distribution f@sum. A superposition of solutions
has the form m(S: e←e |E), the integral with respect to some distribution m
of E's natural embedding in S. The available scattering processes will imply
some kind of redistribution
over E, collectively implying a flow dynamics
for m whose character we know poorly save that it is jam packed full of
randomness. The basic idea of statistical physics is to reason that the actual
superposition of solutions we see will be as random as it can get
while
yielding a solution (i.e. a member of E) as its integral of (S: e ←e :E).
Now, two distributions on E may produce the same member of S as their superposition: and the form our constraints take becomes complex and unhelpful. However, each member of E must be some superposition of basis vectors (because b spans S), so any superposition of solutions to the equations is equally a superposition of b: and two distributions superposing b to the same member of S must be equal (because b is linearly independent), so the ambiguity in among distributions on E presents no problem. Because b diagonalises our constraints, it will render them into a simpler form than E could offer.
We thus deal with (S:b|dim) rather than (S: e ←e|E), so the corresponding tool for superpositions becomes a distribution on (|b:), which (since b is monic) amounts to a distribution on dim. We already have a standard distribution on dim, namely sum (if dim is discrete; if continuous, we need to use integral in place of sum; but this makes no difference to the form of the discussion). As noted above, any mapping ({scalars}: f :dim) yields f@sum as another distribution on dim. [When dim is a continuum, some distributions on it might not be of form f@integral; but we suppose that distributions we actually want to use will be of this form.] We can thus identify any distribution with the ({scalars}: f :dim) for which f@sum is the given distribution; indeed, doing so is exactly the process of taking coordinates in S, using b as basis, of the superposition that results from the distribution. So, now, let's describe an arbitrary solution in terms of its co-ordinates, and see what form our constraints take.
We can use our bases, b and q, of S and dual(S) to express ν in co-ordinates: define ρ = ({complex}: q(i)·ν ←i |dim) and we obtain ν = sum(S: ρ(i).b(i) ←i |dim). Since q unit-diagonalised s, we have s(b(i)) = q(i), so we can infer s(ν) = sum(: *(ρ(i)).q(i) ←i |dim). Since b and q also real-diagonalise each X in Observed, we can define
whence
so define
From S in Observed, as the identity, we get Ob(S) = (: 1←i |dim) whence
1 = s(ν, ν) = sum(r), and we know r's outputs are non-negative, so each
output of r is at most 1 (at least when dim is discrete). This makes r@sum a
(candidate to be used as a) probability
distribution: equally one may
think of r as indicating the proportions
in which we are mixing the base
states represented by b when we build up ν (but this should not be treated
too literally, since r lacks the phase information of ρ).
For each i in dim we have b(i)×q(i) in {linear (S:|S)} and composing
it before s yields q(i)×(*&on;q(i)) which is Hermitean, so b(i)×q(i)
is an observable (in the technical sense, not necessarily a physical one); it
also commutes with each of X in Observed, since these are diagonal. Its
expected value in state ν is simply r(i), and quantum mechanics regards it as
the probability of observing a state (previously) in state ν to be (when
observed) in state b(i). In principle, at least, r is consequently
observable
.
Now, r is defined in terms of b, our basis, but none of the definition of
either cares
about any relationships among the labels (members of dim):
so we should likewise consider all members of dim independently. Our
constraints would complicate that, but the optimisation technique I'll be using
allows us to ignore the complication until we've done the optimisation.
We need a quantifiable notion of how random
r is: information theory comes to the
rescue, with sum(: r.log(1/r) :), that is sum(: r(i).log(1/r(i)) ←i :).
This is the information-theoretic entropy of the distribution described by r.
The probability of getting distribution r at random
depends on r only via
its entropy, and increases with it. [Aside: log(x)/x tends to zero as x tends
to infinity (because log grows slower than k.x for any positive k), so
t.log(1/t) does likewise as t tends to zero (from above), so treat 0.log(1/0) as
0 if r takes the value 0 anywhere.] So we can use f(r) = sum(r.log(1/r)) as the
function we aim to maximise, within our constraints.
Our constraints are (: sum(r.Ob(X)) ←X |Observed) = Val, which has the
form g(r) = given
required to use a standard optimisation technique. [K(i)
is the value Val would have if ν were b(i).] Introduce K = transpose(Ob) =
(: (: Ob(X, i) ←X |Observed) ←i |dim), so K(i,X) = Ob(X,i), and obtain
We'll be looking for some r for which f'(r) and h&on;g'(r) are parallel
for some linear ({reals}: h :{mappings ({reals}: :Observed)}), which is a
distribution on Observed. Given that Observed is finite (so that we can sum
over it, for instance), we have a measure called sum on it already, and we can
express any measure on Observed as a scalar function by which to weight
its members when summing over them. This leads us to some ({reals}: H
|Observed) equivalent to h as follows:
Any sufficiently small change, d, in r produces a change in g of near enough g'(r)·d, which must be zero if g(r) and g(r+d) both have the correct value. At a stationary point (e.g. any maximal or minimal point), no allowed displacement changes the value of f, which requires that every d with g'(r)·d zero also has f'(r)·d likewise zero. This implies that f'(r) can be factorised as −h&on;g'(r) for some h, as above (I could equally have used f'(r)=h&on;g'(r), but we have a choice here and f'(r)+h&on;g'(r)=0 gives H positive when we work out the details).
Now, (: t.log(1/t) = −t.log(t) ←t :)' = (: −1 + log(1/t) ←t :) so differentiation with respect to r(i), for any given i in dim, turns f(r) = sum(r.log(1/r)) into −1+log(1/r(i)) and g(r) = sum(r.K) into K(i) = (: Ob(X,i) ←X :). We have h&on;g'(r) = h&on;K = (: h(K(i)) = sum(: H(X).Ob(X,i) ←X :) ←i :) = sum(H.Ob), so we obtain 1+log(r(i)) = sum(: H(X).Ob(X,i) ←X :) or
Now, S is the identity, and each s(b(i), b(i)) is 1, so Ob(S, i) is always 1 and we have log(r(i)) = H(S)−1 + (sum(:H.Ob:J))(i), so the constraint given as S becomes 1= sum(r) = exp(H(S)−1).sum(: (exp&on;sum)(:H.Ob:J) |dim), so exp(1−H(S)) = sum(: exp&on;sum(:H.Ob:J) |dim) so we are able to express one of our constraints as a constraint on H(S), in terms of the remaining (:H:J) independent Observeds.
Pause to consider whether we got a maximum: differentiating f with respect to first r(i) then r(n), for n, i in dim, we get r(n)'s derivative of −(1+log(r(i))), which is zero unless i=n, when it is simply −1/r(i), which is negative; g's derivative is independent of r, so its second derivative (hence that of h&on;g) is zero. Thus f+h&on;g has negative definite second derivative with respect to r, implying that our stationary point was a maximum (when you move far enough away from r to get a change in f+h&on;g, it'll be negative no matter what direction you took), which is what we wanted.
Introduce
Our solution now becomes r(i) = zed(H,i) / exp(Y(h)), with both zed and Y ignoring H(S) so, in effect, H is only defined ({scalars}:|J) from here onwards; and h ignores the S = 1 component of Val. Pause, now, to consider the derivative of zed and, thereby, of Y.
For any X in J, differentiation with respect to H(X) turns log&on;zed(H) = (: sum(:H.K(i):J) ←i |dim) into (: K(i,X) ←i |dim) = Ob(X), hence zed(H) into Ob(X).exp&on;sum(:H.K:J). Consequently, exp(Y(h)) = sum(: zed(H) |dim) has derivative, with respect to H(X), sum(: Ob(X).exp&on;sum(:H.Ob:J) |dim). Now, from g(r) = Val,
This just divides our derivative of exp&on;Y by exp&on;Y, which is thus the simple derivative of Y with respect to H(X), so (:Val:J) is Y's derivative. Thus
We have exp(Y(h)) = sum(: zed(H) |dim) and, for each X in J ⊂ Observed, X = sum(: Ob(X, i).b(i)×q(i) ←i |dim) so
which is diagonal with respect to our basis, hence formally an observable. In particular, we can define exp() on {linear (S:|S)}, and it maps one diagonal mapping to another, mapping each diagonal entry to its own exp() as a scalar: so exp(sum(:H.J:)) is just sum(: zed(H,i).b(i)×q(i) ←i |dim) = sum(: zed(H).b×q :dim) and its trace is exp(Y(h)). Let Z = exp(sum(:H.J:)) = exp(h(J)) and observe that Z is defined independently of our choice of basis, b (albeit we compute it using that basis) and is an observable compatible with all those in Observed. We have Y = log&on;trace&on;Z so, again, Y doesn't depend on our choice of basis (whereas zed manifestly does – it's the co-ordinates of Z wrt b, q).
It is interesting, now, to look at the final value of the function whose maximum we've found. That's
which is manifestly independent (because Y and Z are) of our choice of simultaneous eigenbasis for our observables, even though f was defined in terms of that basis: the final value depends only on the observables and their values. Furthermore, h(Val) = h·Val = h·Y'(h), so Y(h) −h(Val) is just the estimate we would make, by linear extrapolation from Y's value and derivative at h, of the value of Y(zero).
Thus far the only time I've needed many
of anything is hidden inside
how I assess the randomness
of the function r. The derivation of the measure of
randomness, as sum(r.log(1/r)), involved making up r of a large number
of
dollops. Its validity may involve more measure theory than I know, but the
dollops of which it needs large numbers
are purely a fiction of the
construction: they make no pretence at having anything at all to do with
anything physical. Any distribution which arises as a result given one number
of dollops also arises for at least some larger number of dollops (e.g. any
multiple of the original number), indeed the distributions made of many
dollops are taken to be dense
among the possible distributions. Since,
for these distributions at least, we know the probability
will depend on
the value of r only via sum(r.log(1/r)), we infer that the same will
apply to any probability distribution
we manage to infer on the space of
(: distributions :{mappings ({scalars}:|dim)}) will depend on position, r, in
this space, only via the same sum: which may be evaluated without
reference to how many dollops were involved. So the only large number
needed thus far was used in an abstract context to justify using a measure of
randomness which can be evaluated without reference to the large number in
question.
Thus the conclusions derived here apply to any quantum system, even if it consists only of one molecule or, indeed, a single atom. In so far as its internal dynamics permit it various states with its observed properties, we can expect it to be in an entropy-maximising superposition, of these states, of the kind described here.
Suppose we add a new observable, M, to J, but that it's entirely determined by the other observables, i.e. its expressible as a function of the observables already in J. Intuitively, this shouldn't make any difference; so let's see what the theory predicts.
We have Val(X) = ∂Y/∂H(X) = ∂(log(sum(: exp(sum(H.K(i))) ←i :dim)))/∂H(X) and we're adding an extra term to the sum(H.K(i)), H(M).K(i,M). Our partial differentiation formally treats H(M) as being independent of the other H(X), with X in J, so this won't change Val(X) for X in J; but shall imply a value for Val(M) which we should be able to infer from the other Val(X). Each K(i) gets a new component, K(i, M), which can be expressed as M's function of the other K(i, X) with X in J; we get H(M) as multiplier for this new component.
What's shown above is that:
Notes:

Written by Eddy.