On Knowledge Representation

[<< | Prev | Index | Next | >>]
Saturday, July 13, 2013

On Knowledge Representation

The task of storing a phone number in an address book seems pretty trivial on the surface, but it has stumped me for years. The issue, of course, is not storing the number itself, but rather representing the context in which it applies. The usual approach is to pigeonhole everyone into the common cases -- work, home, and cell, or maybe a flexible list of tagged numbers. But what if I want my address book program to understand me (enough to give me the right number at the right times) when I say that John's number during August will be 555-1234 between 6 and 9pm except on days that it is raining or if I am calling from out of state?
We humans are pretty good at describing things with natural language, but getting computers to understand it is another matter. Although rarely admitted, the problem is not writing a program that understands natural language, but writing a program that understands anything. That, in turn, is mostly about representing knowledge in the first place--not for communication, but in the mind of the program itself.
This essay is an exploration of that topic, with an eye toward defining a good "machine language" for artificial intelligence, but also just for understanding some of the common knowledge representation paradigms (especially those used in programming languages) in a broader context.
What is Knowledge?
Loosely, knowledge can be divided into two categories: model, and state. A model is a representation of how things work, of what is possible, of what things go together or don't, of what is always and everywhere true. State is a representation of how things are at a particular time and place. In more mathematical terms, a model is a statistical distribution over all possible states, saying which states are likely, which impossible, and so on. Your model of the world is your intuitive sense for how the world works. Your state is what you see, hear, think, feel, at this particular moment, along with the state of the rest of the universe around you.
Correspondingly, your memory comes in (at least) two distinct types: conceptual memory, which over time has learned to model the world--everything from your ability to recognize a tree to your expectation that someone will giggle when tickled--and instance memory, which lets you recall past states, your memory of particular things and times in the past--such as that particular time your friend giggled when tickled.
The representation in our heads is obviously very general. Even from single photograph, we can focus on the color of a dot, the make of the car, or the relative size of the two dogs. Our understanding of the scene goes far beyond just the items in it, but includes their relationships to each other, and to many things beyond the scene itself. If a part of the scene is obscured with a stain, we can easily imagine what might be there. If asked to describe it, we could communicate it fairly concisely, and someone else could imagine an approximation to it in their heads.
So how do we do it, why is it hard for a computer, and how can we fix that?
Let's start by looking at state representation since, as mentioned above, a model can only be understood as a distribution over possible states, so if we can't represent state we would be hard pressed to represent a model.
The State
Perhaps starting as early as childhood algebra, most of us learn to consciously represent state through the analogy of things in containers. "X = 10" means there is a particular thing--the number 10--in the X container. We might imagine ten stones in a cup. We call X a "variable" because it can hold different things. We call the things "values" because the first things in containers were things of trade, and what we cared about was their worth. In algebra, our goal is usually to infer what value the container or variable holds. Later on in an office job we might create forms with spaces to fill in values. In computer programs, we make complex, hierarchical structures of containers, create new kinds of values which are pointers to other containers, store collections of value in databases, and so on--the paradigm permeates computer science. The analogy is so ubiquitous that if ever it falls short of some task, we think in terms of augmenting or adding to it rather than starting over with something else.
But even in that childhood algebra, we also saw things like "X < 10". Working with our existing analogy, we learn that this constrains what might be in X. Everything seems fine, the analogy sound, except for one thing we take for granted: This new knowledge we have about X doesn't fit in X. Before we have any particular value of X, we have this fact "X < 10" that X itself (a container of integers) can't hold. If you wanted to represent five using a cup, you can put five stones in it. If you wanted to represent it using a field on a form, you can write "5" in the field. But how do you represent "< 10"? You might be tempted to write "< 10" on the cup, or in the form field, but what if we also learn it is prime, non-negative, and the median lifespan of a Metaturian norgblat (which we don't happen to know offhand)? While "less-than" itself may be a concept, "X < 10" is clearly a fact--it tells us something about a particular state (of which X is a part), and yet our fundamental representation of facts--things in containers--doesn't gracefully handle it.
I say "gracefully" because clearly we can contrive ways to do it, and we do: We can create a new kind of container which holds representations of facts about other containers. So that the whole expression "X < 10" itself becomes, in one form or another, a "value" that happens to tell us something about another value. We write it on the side of the container or in a footnote under a form. On one hand, this is pretty nifty, because going meta like this allows us to write facts about facts and that can be very powerful. On the other hand, yuck! Here we have the simplest notion, something we learned as children, and our ubiquitous knowledge representation cannot encode it directly. We can store 10 in X, but in order to store "X < 10" we need to create another container, X', which holds statements about X, and then we need to know to look in X' when we want to know about X... Now, to be fair, mathematical notation puts "X = 10" and "X < 10" on equal footing, but I will get back to that. It is worth note that none of the most popular computer programming languages allow you to pose "X < 10" as a statement. They all allow it as a question: once X contains a particular value, you can inquire whether that value is less than ten. But you cannot simply say that you know X is less than ten! Any non-programmers reading this essay may find this as a surprise--a point which any programmers reading this essay should reflect on.
It is tempting here to say that "X < 10" is a rule, and belongs in the model, not the state. But while saying that "a 9-stone cup can only hold 9 stones" may be a rule, to observe that "the stones are in a 9-stone cup and so there must be less than ten of them" is not a rule but rather an inferred fact about the current state. More generally: most of the knowledge we have about state comes from observation or inference, both of which are usually incomplete. If we read our weight as seventy kilos, we are not observing "X = 70" but more typically something like "~69.5 < X < ~70.5". Even the assumption that at some resolution there is a true and definite state of the universe is challenged by the peculiarities of quantum physics. But whether there is or isn't, our knowledge about any particular variable X is often something other than the value it contains, but rather the various ways it relates to other variables. So how can we represent that knowledge?
Our childhood algebra seemed to handle "X < 10" just fine, so what's going on there? Zooming out a bit, that mathematical notation is really just an alternate syntax for predicate notation. In predicate notation, we might write "X < Y" as LessThan(X, Y), "X = Y" as Equals(X, Y) and so on. An equation like "X + Y < Z" has a hidden variable for the sum of X and Y: Sum(X, Y, A), LessThan(A, Z). In this form, suddenly the situation is reversed: all of the information is now outside our variables instead of inside. Rather than containing a value, they are contained by, or participate in, various predicated relationships with other variables. Or so it seems until we hit that pesky constant, 10.
When we write LessThan(X, 10) it looks like our variables are value holders after all. Correspondingly, many predicate systems, such a Prolog, are rooted in variables and values (things in containers). For instance, to say in Prolog that Mary is 25 years old, you might say AgeOf(mary, 25). Here, just like the constant 25, mary is a unique constant value (referred to in Prolog as an atom) which forever refers to Mary, such that we might elsewhere say OlderThan(mary, john) to mean Mary is older than John, and so forth.
The problem with this is subtle but critical: When we create relations directly over eternal values (constants like the atom mary), we lose the ability to contextualize that knowledge. For instance, EmailOf(john, jd@Xcorp.com) might mean "John's email is jd@Xcorp.com". How then do we say "John's email in sales at Xcorp is sales@Xcorp.com, his email there as tech support is jd@Xcorp.com, and his email at Zcorp is jd@Zcorp.com." The usual answers are all painful. The least painful is the relational database approach, but for reasons I will get to that is only half an answer.
The solution is to abandon the values-in-variables paradigm completely and move to a truly pure predicate representation. This means moving any constants outside of the parameter list so that, for instance, AgeOf(mary, 25) becomes Mary(M), 25(V), AgeOf(M, V). Here M and V are no longer variables that hold things, but rather pure interfaces that do nothing more than participate in relations. The concept of type is not lost: V can be seen as an integer interface, able to participate anywhere any integer interface can. But in this representation it is not a value holder, per se, since all manner of relational facts about it are on equal footing, and all merely reference it without looking inside. Consider, for instance, "Mary is older than John": Mary(M), John(J), AgeOf(M, V), AgeOf(J, U), Greater(V, U). Here our only knowledge of V is AgeOf(M, V) and Greater(V, U). Our representation is not storing things in V.
The proliferation of predicates appears problematic, but note that Mary is just a new constant same as mary was in the Prolog example. Except here, that constant is entered into the space of predicates--which is already a set of eternal constants. The predicate 21 likewise can be handled behind the scenes in essentially the same way as Prolog handles its integers: with a binary representation. Here we might use a standard set of bit predicates: Bit0(V), !Bit1(V), Bit2(V), ... (where !Bit1(V) means we know Bit1(V) is not true; we can likewise say !Less(V, U) and so on). Again this maintains predicates as eternal constants and parameters as transient interfaces--always and with no exceptions.
Consider for a moment if the state knowledge we are trying to represent is the state of a running C program. We might have two fields in a union which refer to the same four bytes of memory--one an integer, the other a short string, for instance. The underlying interface for both is the same array of bytes, yet the two fields present distinct interfaces to those bytes which "behave" as integer and string. The union example makes a distinct illustration, but the same thing is true of all of our program variables. In effect, our variables are not themselves value holders, but merely interfaces to interfaces to interfaces to the bits in the memory chips or CPU registers which finally do hold ones and zeroes. When we (humans) represent knowledge about these variables it is relational information about them as interfaces, not values stored in containers. Likewise, even an optimizing compiler thinks about variables primarily in relational terms, and may draw conclusions and generate code that relies entirely on relative truths without ever determining a "value" of some variable (which may in turn not even be explicitly represented in the final program anywhere).
Back to our earlier problem, this all helps us because now our interfaces are themselves distinct and reliable contexts. Whereas before, we implicitly anchored on the identities of the values of the variables (note that even most predicate systems "unify" two variables into one when they are determined to hold the same value), here we anchor on the identities of the interfaces themselves. Instead of EmailOf(john, jd@Xcorp.com), we have John(J), EmailOf(J, E), "jd@Xcorp.com"(E). This allows us to complete the context of J without impacting other instances of John: Company(J, C), Xcorp(C), Role(J, S), Sales(S). Note that we can contextualize other interfaces besides J here, such as Country(C, I), Ireland(I), to let us know we are talking about the Irish portion of Xcorp only. For our next fact about John, we create a new interface, K: John(K), Role(K, T), TechSupport(T), and so on.
Equality in this representation becomes just a relationship like any other. That two complex number interfaces A and B are equal just means they represent the same complex number. It does not mean they are the same interface: One of them may be backed by a polar representation while the other Cartesian, for instance, which tells us something about their relative precision in different regions and so forth. We can simultaneously know they are equal as complex numbers and yet different as complex number interfaces, and we can express all of this in one common, simple representation: Equal(A, B), Cartesian(A), Polar(B).
Relational databases, it happens, effect a similar organization if we view the tables as types and the (often implicit) unique row IDs as interfaces:

Row Person Company Role Email

J john Xcorp sales sales@Xcorp.com

K john Xcorp techsupport jd@Xcorp.com

L john Zcorp cook fud@Zcorp.com

M mary Zcorp assassin goodtimes@Zcorp.com

N rob Xcorp sales sales@Xcorp.com

It is not an identical representation, but it illustrates the structural similarity of the relational database approach to the pure predicate representation.
The predicate notation dates back to roughly forever, but for whatever reasons it is usually analyzed and implemented in terms of sets of atoms (conceptually held as values by variables) rather than as pure relations amongst massless interfaces. It is a subtle distinction but one that carries with it unfortunate baggage--particularly around notions of identity and equality.
So what do we mean when we say a true or false predicated relation "means" something?
The Model
The model encodes how one part of the state relates to another, and how state evolves over time. (If we choose to view the state as being replicated or unrolled over time, then the two are the same: one part of the unrolled state represents one time, and a different part represents another.) While the state may encode that X is 1, the model encodes that 1 + 1 is 2. And by encoding how states relate to themselves in this way, the model also dictates what states are possible or impossible, probable or improbable. It also allows us to infer one part of the state from another, such as the future from the past. In effect the model encodes what any one part of the state means to another part of the state. If that model is our entire world model, then we ourselves are part of that state and so the model encodes what things mean to us.
The quality of a model can be measured in how well it reflects the true probability distribution of the states. In the simplest case, we don't want it to allow impossible states, or disallow possible ones.
The first hurdle of building a quality model is finding a good state representation for it to work with. Hopefully the section on state above has given some sense for why the usual (value-based) representations are a serious handicap in that regard.
The second hurdle is finding a good model representation that easily expresses relationships within the state.
Since the value-based state representations used in most programming languages do not inherently express any direct relationships within the state whatsoever, their associated modeling languages must. In the usual case this is done via functions or procedures, which are in effect predicates in a supplementary state representation--the program's dynamic calling stack and program counter, which can select or bypass predicated relations via conditionals. Having this implicit supplementary state is problematic for many reasons: It is a distinct state representation from the primary one (conditional code vs. stored values); it is implicit, which means there is no direct access to it (e.g., a function doesn't know its parents in the calling chain); and it is severely constrained by the temporal details of its implementation (sibling functions are evaluated in sequence rather than being simultaneously active). Code-as-data functional languages like Scheme improve on aspects of this, but don't ultimately escape the constraints of their value-based representations.
The predicate based state representation, in contrast, already expresses relationships within the state, so the task left to our modeling language is merely to express relationships amongst the relationships--whose states in turn are only true or false.
A typical example would be the universal truth: Less(A, B), Less(B, C) -> Less(A, C). Here "->" means "implies". We could directly support "implies" and similar rules in our model, but there is a simpler and more general solution which has many advantages: Similar in spirit to the relational database, we can make a table with columns representing relations, and rows enumerating the possibilities with cell values limited to true (T), false (F), and unknown (? -- which can also be seen as a shorthand for "either T or F"):

Less(A, B) Less(B, C) Less(A, C)

T T T

T F ?

F T ?

F F F

This table fully captures Less(A, B), Less(B, C) -> Less(A, C). But note that it also captures !Less(A, B), !Less(B, C) -> !Less(A, C), which would normally require another rule.
Believe it or not, because the state representation is so general, this is all we need to model just about anything. We could fluff it up with syntactic sugar (adding infix notation, compiling rules into tables, constants into states, and so on--see Programming in Syn) but let's explore the properties of the raw representation.
Model Structure
Note that the above table is nothing more than a recording of possible states. It can be populated by simply picking any three interfaces related by Less() and observing the true/false states of those relations over time. This is true of any such table--once the column names (relations) are assigned, we can pick any random set of interfaces that mutually participate in those relations and simply observe them over time, naively recording everything we see. If we want probabilities instead of just possibilities, we can keep counts of repeated rows. And since the resulting tables are empirical, it is safe to create tables from any random collection of relations without fear of introducing bugs or false information--in the worst case, the table fully populates (allowing all combinations) in which case it has no impact and can be thrown out. Many of the common algebraic rules and identities can be automatically inferred by this technique just by letting the system watch numbers go by.
So, this simple system can learn conceptual relations by example.
Unlike a typical relational database, the columns have globally defined meaning, which means tables can be combined automatically. For any finite set of interfaces, we can pre-compile a finite set of interconnected tables over those interfaces, where tables are joined by their column names (with substituted interfaces). So, for instance, we might have a table:

PlusOne(X, Y) Less(X, Y)

T T

F ?

which can be joined with the previous table by substituting each one of <A,B>, <B,C>, or <A,C> for <X,Y>, meaning three instances of this table would exist in our pre-compiled structure over just A, B, and C. (Note by "joined" I do not mean that a single larger table is being created--although that can be done if we have no concern for space--but rather that instances of the tables are simply connected together by their like column relations.)
Once a finite number of interfaces is selected, and the tables linked together accordingly, and once the tables are connected to their corresponding relations in the state (there is exactly one Less(X, Y) relation in the current state, and it would be "wired" to exactly every Less(X, Y) column in the various tables)--once that is done, the interfaces (nee variables) go away entirely. Far from where we started where variables were the containers of value, in this representation they are not directly represented at all. The state of the system is held in the True/False/Unknown state of the relations only. There would be one ternary bit for Less(X, Y), and that bit would be referred to (read and write) by all the tables with the corresponding column. There may also be a Bit0(X), Bit1(X) and so on, each with its own ternary value, and there is where one would point for the traditional "store" of the value of X. But note that under some circumstances those may be absent entirely (as when X is being reasoned about only symbolically), and so one would often be hard-pressed to find "X" in the resulting schematic.
Application and Inference
The "execution" of inference in such a system amounts to applying the tables to the column values: for all known (T or F, but not ?) column states, sub-select all compatible rows, and see what those rows jointly imply about the remaining columns in the table. For many cases of inference, this will propagate implications around the network of relations and fairly quickly resolve to a determined and consistent state. If ever an inconsistent state is found, it means that either the inputs (the relations that were seeded with T or F) were inconsistently set, or that one or more of the tables are incomplete. (Open question: how to determine which table is the likely culprit in that case, and automatically extend it with the observed state.) More often, the state will not resolve because of uncertainty (no firm implications can be drawn), in which case the statistical information (row counts) can be used to choose most likely implication in an otherwise non-deterministic table, and attempt to resolve the remaining state from there. If that fails (inconsistency), then that hypothesis can be reversed, and a second-best choice made, and so on. I.e., standard search methods can presumably be applied here (for example, the kindred spirit Alloy employs SAT solvers), with the added advantage of having strong statistical hints to bias search order.
So far I have only talked about the static structure compiled over a finite number of interfaces. It is equally possible to link two tables--or two compiled collections of tables--temporally via a pivot of interfaces. Most interestingly, we can compile our entire collection of tables together (call that a meta-table), and then link that meta-table to itself via a finite number of pre-determined pivots. A pivot here amounts to a change of interfaces, including dropping some and introducing new ones. Consistent with the earlier observation that the interfaces themselves aren't actually represented anywhere, the net effect of a pivot is to connect relations from one instance of the meta-table to like relations of the other instance of the meta-table, where the exact connections depend on how the interfaces align between the two tables. This in turn, in practice, amounts to simply permuting the relation states (limited to within each predicate, so that, for instance, the various instances of Less() in the meta-table may swap around their current T/F/? values with each other), possibly archiving some for retrieval when and if the pivot is reversed. This pivot amounts to a change in focus from one set of interfaces to another, and through the above machinations any inferences that can possibly be carried through will be. Combined with the search notion mentioned above, this potentially results in a very robust reasoning/inference/perception engine. Note that the pivot choice in particular is very benign in the sense that it alone cannot result in disinformation--simply pivoting around at random is much like shining a light randomly about a room, and just allows more information and inferences to accumulate. Pivoting is exactly analogous to simply extending the pre-compiled structure to a larger number of interfaces, but on the fly rather than pre-compiled, thus multiplexing a finite amount of hardware into a model of effectively infinite extent. This simple equivalence should facilitate extending the search heuristics into the pivot-extended model.
Another point of note is that the meta-table can easily be split between multiple CPUs or computers, joined by just the ternary bits of common relations. Likewise for new pivoted instances of the meta-table (at which point it ceases to be a pivot, and becomes a dynamically expanded meta-meta-table interconnected by relation permutations).
As a graphical model, each relation, such as Less(X, Y), would be represented by a single boolean node, and each table would be represented by a node with one state for each row in the table. Table nodes would connect to the relations in their column names, and relations to all such tables that name them. Thus relations would be connected to each other only via common tables, and tables to each other only via common relations. To the degree the model naturally partitions into multiple levels of abstraction, it would likely fall into alternating layers of boolean (relation) nodes and N-state (table) nodes, with no lateral connections within the layers.
Although the raw search space for drawing inferences in many cases might seem impractically large, the addition of more abstracted representations can facilitate faster resolution. Finding your way to the supermarket needn't involve considering every twist and turn you might make in your living room starting from your sofa -- heading toward the most door-like thing you can see makes a good first approximation. Over-representation isn't (significantly) penalized, so adding useful abstractions (over which associative tables can be learned from experience if not coded by hand) can only speed search. That is, the difficulty of resolving state is not monotonically related to the model size: many additions to a model provide the proverbial short cut and make things easier.
While sanity prohibits making poorly informed postulates about analogies to neural anatomy, I never claimed sanity so I'll do it anyway: One could imagine, just for instance, that the excitatory cells in cortical columns are analogous to rows in these tables, each cell representing an example of co-activation of relations. The relation (table column) states, in turn, would be represented by long-range outputs of the cortex, largely mapping back to thalamus, where they are fed back to cortex for same-frame search via topographical projections, and for pivoted search via various (seemingly) non-topographical projections. I.e., the thalamus could serve as the pivot-implementing focus multiplexer, while the cortex implements the fixed meta-table. The common case of mutual exclusivity amongst a set of relations (a fat table in the general case) could be special-cased via laterally-connected inhibitory neurons. Per the search light analogy mentioned earlier, the sequence of pivots here is not critical. As long as the permutations are consistent (which is the default in a hardware implementation), the pivots (shifts in focus) simply expand the effective scope of the model via hardware multiplexing, meaning at worst such pivots are uninformative but they can never introduce false information. Learning to problem solve here then becomes a matter of learning efficient pivot sequences within a domain where it is safe to randomly wander, which is very much like learning to navigate in a physical space. (Every pivot is reversible, and is effectively a "move" in a conceptual direction.)
But I digress...
Meta Representation
With natural language we say many things that map easily to state, such as "your keys are on the desk". But we can also convey rules, like "if your keys aren't on the desk, they're in the drawer." Or, to intentionally confuse matters with the earlier example, we might learn in algebra class "if A < B and B < C then A < C". You might assume from the above that such rules must be represented in our model, but in fact our model memory is strictly intuitive. When you hear that John is older than Mary and Mary is older than Jack and you "just know" without having to think about it that John is older than Jack, it's a fair sign that that rule has been learned by your model. But when you learn the rule explicitly as above, it first becomes part of your state--mere factual memory, and not intuition.
You are nonetheless able to apply such rules because your model does have a learned and intuitive rule that says: If "if A then B" and A then B. This learned rule lets you translate from a symbolic representation to an intuitive meaning. In effect, your intuitive model has learned to be a symbolic language interpreter.
Natural language as well as symbolic and meta reasoning in general work this way. We learn to map concepts back and forth between the symbolic and intuitive representations, one by one: If [A "is above" B] then Above(A, B), and so on. More completely, we learn to model language (including constructed languages like mathematics) in the same way we model objects in the world, and we laterally bind those two models by their associations, so that the tables which interconnect them act as a sort of polymerase to build conceptual state from symbolic state or vice versa. (This in turn is how/why our comprehension of language is contextually conditioned.)
Because tables can be populated by example, it is possible to build an intuitive sense of a symbolic rule by applying it through this symbolic translation to imaginary scenarios sufficient to cover the space (fill the table). That is, with some effort on our part, we can come to intuitively understand (by integrating into our model) a state represented, symbolic rule by simply ruminating on it sufficiently.
Yeah, So?
It's probably useful. That is all for now.
NEXT UP: OKR 2 - Examples of Intuitive Reasoning

[<< | Prev | Index | Next | >>]

Row	Person	Company	Role	Email
J	john	Xcorp	sales	sales@Xcorp.com
K	john	Xcorp	techsupport	jd@Xcorp.com
L	john	Zcorp	cook	fud@Zcorp.com
M	mary	Zcorp	assassin	goodtimes@Zcorp.com
N	rob	Xcorp	sales	sales@Xcorp.com

Less(A, B)	Less(B, C)	Less(A, C)
T	T	T
T	F	?
F	T	?
F	F	F

PlusOne(X, Y)	Less(X, Y)
T	T
F	?

Simon Funk / simonfunk@gmail.com