Cause vs. Statistics

Oct 3

Here are two fundamental and incommensurable ways to approach understanding, forecasting, judgment, decision-making, and expectation:

1. Explanations, natural laws, stories, narratives, descriptions, anecdotes

vs.

2. Statistical probability

This is just starting to coalesce in my mind. It's still hard to say how these two relate to each other. I can say that although the second can be way more accurate if the pitfalls are avoided, humans are very bad at recognizing, understanding, using, interpreting its implications, for Darwinian reasons I imagine. The upper layers of the human mind (the stream of consciousness) invest heavily in 1 and can access 2 only with great effort and only in a heavily mediated way. My interest in the dichotomy is intensified in light of my new commitment to a Fourier-like view of the map-territory relationship -- all explanations are Ptolemaic. Accepting a statistical understanding of reality that trumps a narrative understanding represents a way forward. The way I've laid out this dichotomy also obviously reflects Kahneman's system I and system II -- which I've been reading about lately. But I'm not pitting intuitive bias vs. logical reasoning. Both of Kahnemann's categories lie firmly in my 1 above. 2 is weird alien territory that people hardly knew about until the 18th century. (Law of Large Numbers/Central Limit Theorem, Bayesian inference)

The above dichotomy also reflects the direction of the history of physics from the classical view to the quantum view. A heavy reliance on 2 over 1, is very close to the Copenhagen Interpretation. To some extent, this is deductive reasoning (1) vs. inductive reasoning (2), but that doesn't completely characterize it. Inductive science, it seems to me has been a means of choosing between various narratives rather than a way to transcend narrative completely.

I heard a perfect illustration of that on an old Radio Lab episode I heard today (3/13/2022). Apparently some AI program can do amazing things with finding regularities in masses of data. It "discovered" Newton's F=ma without possessing any rules of physics; it just had access to massive data about the behavior of a chaotic pendulum contraption. This program, the story continued, was also able to describe in a few equations how a particular complex biological system would behave in terms of changes in chemical processes. I'll have to listen again to get the deets. The part that struck me was that the biological researchers were able to verify that the equations produced by the program were correct, but they didn't think they could publish their findings because they didn't understand them. They had nothing resembling an explanation for why the equations made such accurate descriptions. Well, by gosh, that sums up perfectly my ideas on the limitations of "knowledge" as we know it. No narrative means no knowledge? Not at all! The correct equations are an end in themselves -- a statistical truth. Is it at all possible there exists no explanation for why these equations work? I think yes. Pretty much everyone else in the knowledge industry says no. They would say that we aren't yet smart enough to explain it. But, again, I think this is a misunderstanding of the basic limitations of beings whose access to the world is mediated by a rubbery bubble that does its best to cover its own tracks and convince us it isn't there at all. Explanation is a human activity and probably an activity of any other sentient creatures in the universe (deriving from a Bubble and Beacon view), but it isn't a fundamental aspect of the territory itself. Explanation is Ptolemaic. Is there something unclear about what I'm trying to get across! Yes! Ugh! Explanations and narratives are always arbitrary fairy stories -- even my strained pleadings now. All that we know for sure that has a real existence is the underlying regularities.

"About what percentage of cases like this are resolved in favor of the plaintiff?" "Don't be silly; every case is unique." Both the statistical perspective and the individual-cases perspective are legit, but they are thoroughly at odds. The probabilistic approach is clearly right in the long run, but maybe not for individual cases. Still, you have to think that the current case is part of that long run. Isn't that our best evidence? There ought to be a quick way to signal which of the two ways -- individual cases or longterm regularity -- one is talking about. They can't be mixed -- like theories derived from my switched assumption thing. Precisely that. This is a prime assumption switch that we are already using in scientific discourse.

"About what percentage of coins turn up heads?" "Don't be silly. It all depends on what happens on the particular coin flip." Individual cases can often depend too much on complexity, randomness, and the vagaries of narrative bias to be of much use.

The key to making perspective 2 worthwhile in the above two cases is that there is a degree of uncertainty. But isn't there always? At some threshold of randomness, perspective 2 becomes vastly superior.

This should turn into a whole big thing for me. I wonder if I can sort it out. Examples!

Thinking about this dichotomy proves to me how fundamental explanation (qua ex-plaining) really is to the human mind. The bubble-beacon perspective is built around smoothing disturbances. #1 accomplishes that goal, but #2 doesn't. Thus, statistics, while superior, are less satisfying. That's really important. Stats don't ex-plain. They forecast without explaining. Maybe they meta-explain. Thus, a statistical approach to knowledge is simply not for the world of humans. Science, at its best (IMHO), avoids explanation and refers only to demonstrated regularities and correlations, statistical analyses of experimental results. Policy- and decision-makers may have to take action based on those results, but that is outside the purview of science. Humans need and respond to stories (because they allow us to fend off, understand, assimilate, spread influence). This is a deep divide that I'm not quite able to express, but humans and other natural systems in the bubble-beacon framework are intrinsically, ontologically committed to 1 over 2.

________________________

A digression:

Tensegrity is a name coined by the redoubtable Buckminster Fuller to describe self-supporting structures that possess precious little physical structure. A simple tensegrity column may include just three relatively massive rigid struts of equal length and nine relatively wispy wires. The struts aren't in direct contact with each other. Instead each of the six strut ends is attached by wires under appropriate tension to three of the other ends. Anti-intuitively, the thing stands up spring-like and can even bear quite a bit of weight, depending on the tensile strength of the wire and compression/torsion strength (?) of the strut. That is, a tensegrity column can replace a traditional massive column whose whole job is to bear masses aloft. A quick search will yield the images you need to understand this. The only weakness with a tensegrity column as a construction concept is that if you snip one wire, the whole thing collapses. Its hierarchical, interdependent nature makes it too untrustworthy; it's easy to sabotage.

I brought tensegrity up because of an unusual recursive property it exemplifies. Since the completed structure is essentially a column, it can itself fill the role of a rigid strut, so it's possible in principle to replace each of the three struts with a smaller version of the original tensegrity column. (I don't recall where I came across this idea.) You can probably see where I'm going with this. There's no reason you can't keep replacing more massive struts with less massive tensegrity columns until there's almost no rigid structure left at all; merely wires under tension miraculously carrying a load, and vanishingly little of a recognizable kind of mass or structure.

For some reason, this has become my go-to image for replacing the cumbersome mechanism of explanation with the wispy elegance and recursiveness of probability. The explanation gets the job done but with a lot of expensive mass that a probabilistic approach doesn't need. Standard column structure goes with linear reasoning and tensegrity goes with unanalyzably complex interdependence -- like the world. Non-narrative recursion sometimes allows us to get rid of the massive machinery.

Continuing to digress:

Here's a sort of example that will take some time to set up. A computer program that plays a winning strategy of tic-tac-toe could just be a long series of if-then statements or perhaps a couple of clever procedures. In either case, in order to write the program the programmer would have to understand a winning strategy and be able to explain it. That is, the program is all about explanation -- i.e. the programmer's explanation to the computer of what to do to win. On the other hand, you could write a very simple program (equipped with a decent pseudo-random number generator) that experimentally determines a winning strategy with only a knowledge of the rules and no idea about strategy. The often unrealistic condition that holds for tic-tac-toe is that the number of games states is small. Assign a random number to all possible game positions that will eventually represent the desirability of that position, and start to play. In one round of game play, starting from the current position, the program determines all of the (nine or fewer) positions you could get to with its next move, and chooses the one with the highest number (strength) assigned to it. If the program wins its game, the number in each used position is increased by say 1% (multiplied by 1.01). If it loses, the number of each used position is divided by 1.01. Tie games have no effect. This is a loose sort Bayesian inference model. Due to mathematical attraction, after a couple of million plays (i.e. in a few minutes), the program will zero in on an expert ability to play tic-tac-toe which will mostly consist of a set of numbers assigned to a set of positions. Something rather like this technique was demonstrated in the 1950s using matchboxes filled with variable numbers of beads for each position (rather than a computer at all). I have come to call this method of reinforced probabilities the Matchbox Method. The result surely is wispy and nonlinear from the point of view of the programmer. The technique can be applied to almost any game or real world situation where you have large piles of data (like millions of played games or maybe text of billions of Wikipedia articles) and some degree of feedback (like winning and losing or some other success-failure criterion). And the point is that the solution is entirely explanation-free. There is nothing in the program that understands tic-tac-toe in the sense that a human does (or seems to) and no way to see an explanation of why it works from an inspection of the boxes. The numbers might well suggest an explanation, but no such thing will be inherent in those numbers. Cut out the middleman for huge savings!

For years, I have been threatening to write a related game-playing program for backgammon, but it quickly starts to sound like a lot of work with billions or trillions of simulated games played in order to achieve good strategies. The difference from tic-tac-toe and the big challenge is that there are too many possible game positions in backgammon for each one to have its own "matchbox." The matchboxes would fill the universe. In chess, the situation would be even more untenable. The simplifying idea that makes the matchbox method applicable is that one can create a "covering" of several thousand or several million matchbox positions that can represent all of the possible positions presumably to a measurable degree. Come up with a crude way to measure the degree of fit between any of the myriad actual positions and those few in the covering. Suppose from your current position, you roll a (4,2) and that this roll opens up 23 different new positions for the player to choose from. Take the position resulting from one of the 23 choices and comb through the covering for say six elements of the covering that are most "similar" to the original -- yield the highest matching score. Use a combination of the "goodness" values for these 6 elements (which are at first arbitrary) to calculate a supposed goodness value for the actual position. (The combining technique can start out almost arbitrary too, but I'm confident an increasingly accurate method will evolve.) Now do the same for the other 22 choices, and simply choose the position predicting the highest chance of victory. If a game results in victory, all of the choices made and recorded in the winner's game log -- even inadvertently bad ones -- are reinforced by increasing the values assigned to the elements in the covering. Values for losing covering elements are decreased. In many cases, a single game will increase and decrease the value of a particular covering element several times. The values will wander, but ultimately settle down (like the value of pi calculated by Buffon's Needle). The key is billions and billions of simulated games and a faith that a solution state is "attractive." The program can be played against a more traditional program or against itself. The elements in the covering can evolve over time, as can the criteria used to measure fitness to the elements and the way these values are combined. That's at least 4 evolving systems. Like with the recursive tensegrity column, elements of logical structure (weight-bearing struts) could successively be replaced with wispier bits of probabilistic structure (tensile wire). In this scenario, the wires are no more vulnerable to sabotage than the struts, I think.

By my very limited historical understanding, researchers in artificial intelligence and/or expert systems spent decades mostly focused on two different lines of inquiry rather than the simple thing outlined above for games: 1) Neural nets that tried to mimic human abilities by mimicking the structure of the brain at a level below conscious explanation, and 2) Programs that mimicked human thought based on analyses of higher-level thinking functions (like the tic-tac-toe program that's a long list of if-then statements or clever procedures). The first of these was in essence probability-reinforcement-based but couldn't exploit the power of a digital computer, so, for example, there was no way to find where the acquired knowledge resided so that it could be manipulated and experimented with and no direct way for the knowledge to be communicated or transferred to other machines. Neural nets are too wispy and interdependent!?

Of course, neural nets are still useful because they can be quite good at pattern recognition in a way that the matchbox method can't match.

The second approach suffered from an overestimate of our ability to correctly analyze thought, an underestimate of how complicated such explanations can become, and the limited range of applicability of any one explanation. It was only with the data explosions in the age of the internet that the third way, the matchbox method, could be fully exploited. If I understand it correctly, IBM's Watson and many of the new autonomous driving systems are of an explanation-free, probability-based sort. Matchbox computation seems to be the way forward.

It is entirely debatable whether such systems do in fact mimic human minds at some deep level, but it's clear that at the highest behavioral levels, human decision-making and understanding are about narratives, stories, unquestioned facts, protagonists, sympathies, rules of thumb, descriptions, explanations, expressions, and visceral preferences and prejudices. And not about holistic and rational assessments of probability. Probabilities are only exploited by people weakly through (neural nettish) intuition or even more weakly through slow and difficult analysis.

It's a fascinating mental challenge to try to reject the narrative/explanation nature of reality in an intellectual way. Impossible I think. Tied to language as we are, is there so much as a sentence we can form that doesn't rest on an assumption of narrative legitimacy? I can't think of one at the moment. If it is impossible to reject narrative, the question becomes "How are we to regard our explanations as secondary to factual\statistical reasoning -- the alternative view of reality, induction versus deduction -- and have the whole thing hold together?''

'__________________________________________

somewhat related musings on autonomous vehicles

I know no details of how self-driving systems work beyond what I can gather from articles in the popular press, etc. So my thoughts here are pretty much 100% conjecture and thus totally wrong. I will proceed anyway. (further study indeed indicates a lack of match between real self-driving systems and what I lay out below)

Self-driving car systems have a huge number of inputs, but essentially only two output recommendations: a speed change (usually zero) and a direction change (usually close to zero).

I picture the programming for these systems as closely analogous to the description given above for backgammon. A "covering" (or representative sample) of input states would have to number in the millions, but there's probably a way to subdivide the millions into 1000s of clusters of 1000s of representatives for faster evaluation. You read an input state. It includes map data, visual data, sonar, lidar, the works. It includes how fast each nearby vehicle is going, how much each is able to maneuver in case of an emergency. Perhaps 50 million bits of information. Score that set of data against the 1000+ cluster reps. Find the, say, three most pertinent, highest scoring clusters. Now scour each of the 1000 elements of each of the these clusters for the three most pertinent representations there and the final computations are some kind of linear combo of the decisions for the (3 x 3 =) nine reps, with coefficients determined by the various matching-scores. The nine representatives and 4000 scorings are of course phony numbers. The actual numbers will be the largest that can be processed in the allowed time -- maybe .05 seconds. These numbers themselves could be subject to evolution. Somehow, the most cautious (or extreme) recommendations (hard deceleration, hard turn to the right) must have greater weight. A hard right and a hard left shouldn't cancel out but must be chosen among. One would hope that in 99% of cases the nine recommendations would agree and a lot of that time the unanimous choice would be zero turn and zero acceleration.

I envision having 20 independently evolved matchbox programs running on 20 independent processors (tons of cheap special purpose chips), and each will offer a recommendation and a confidence level to a central organizing system that will determine a final decision based on those recommendations. I say twenty independent matchbox procedures because each matchbox system will tend to evolve independently and learn different lessons from their experience and have differing sweet spots in their skill sets -- like the European model and American model in weather forecasting. Again recommendations for extreme maneuvers have to be listened to with heavier emphasis to make sure those deadly corner cases aren't missed -- the stop sign that's hard to identify because it has a sticker on it or the person in a very creative Halloween costume so she doesn't look like anything in the database and who is missed by 18 matchboxes and weakly sensed by 2.

If information security issues can be solved, it would be great if the central system also had access to the recommendations of nearby cars -- the wisdom of the crowd would be very beneficial. When the central system is receiving too wide a spread of recommendations from the 20 matchbox programs or the recommendations all have low confidence levels, the coordinating system should carefully slow the car down until the recommendations from .05 seconds later will give a more unambiguous message.

I have little doubt that dangerous errors in these autonomous vehicles will ultimately become infrequent and death rates will be cut drastically, especially as fewer human drivers throw their wrenches into the works. But some deadly errors will look absurd to human judgment (like the famous pedestrian-with-bicycle death of 2018 (?)). This relates to the premise of this piece: statistical choices are explanation-free and inarguably capable of vast superiority by any given set of criteria while human choices are statistics-free, explanation-rich, and subject to 1001 unconscious and erroneous biases. Thus, people will overrepresent the random, unexplainable deaths of a few compared to the equally random but perfectly explainable deaths of many -- the driver was drunk or sleepy or distracted by texts. Somehow, a death caused by an inadequate program that missed something no human would miss is more tragic and unacceptable (or at least generates more terror) than one caused by a flawed human driver who missed something no program would have missed. Understandable, I guess.

Anyway, this state of affairs will no doubt slow the adoption of the technology. What will win out in the end are all of these more modest systems like auto-braking, lane assist, blind-spot warnings, distracted driver warnings convincing people that the systems are superior to the humans.

Joel Nanni

Cause vs. Statistics

Nesting Selves and the Logos

Exercise & the Mind-Body Metaphor

kindahardpuzzles