• Discussions of technical integration
Skip to end of metadata
Go to start of metadata

Discussing the KF "programmer's intro" technical document (attached) - in relation to SAIL

Turadg Aleahmad (lead architect of SAIL) asked the following questions. Andrew Green replied

  • 1) The document mentions RDF. Why didn't they just use RDF?
    Historical reasons mainly. KF is about 10 years old, and some of our databases have been around since the beginning, although obviously transformed as we evolved our storage mechanisms. RDF was kicking around in 1996 (as an Apple project called MCF), and it informed our thinking. We haven't adopted RDF wholesale because there hasn't been a compelling reason to do so. Although a lot of what we're doing is RDF-ish, there are enough barnacles and expediencies that the model of storing/searching tuples has proven to be a good level of abstraction. However an RDF-view of KF would make a lot of sense.
  • 2) How hard would it be to adapt their web UI to use an RDF store for their tuple base?
    The web UI code uses the ZTB API that's discussed in http://www.zoolib.org/zoolib_doxygen/html/group__group__TupleBase.html
    some of which ended up in the programmer's intro doc I sent earlier, so if a ZTB-compatible API could be put over an RDF store, then obviously the web UI would just work. Changing APIs would entail a huge amount of work, frankly enough work that it would probably kill the web UI.
  • 3) What would it take to write a Java Swing UI to their tuple base?
    The KF applet uses Swing for its UI. It's data access is mainly based on the ZTSoup API, which talks to the same physical tuplestore as the ZTB, but is geared towards the demands of a live UI. In particular it is not transaction-based, instead being more like an object store with change notifications. Local changes are batched and submitted, and consequent changes in query results returned (plus any changes made by other clients since the last time we checked with the server).
  • 4) How is the server backing the web UI implemented? Can it run in a JVM?
    The essential code for the server is at http://zoolib.cvs.sourceforge.net/zoolib/zoolib/src_other/tuplebase/
    It's written in C+. In the past we've had JNI-bindings to it, and those could be revived pretty trivially, at which point it could run in a JVM. The C+ client code is in the same directory (ZTBRep_Client.cpp, and ZTSWatcherClient.cpp, for ZTB and ZTSoup
    access, respectively). The java client code is in zoolib/java/org/zoolib/tuplebase.
  • 5) What if SAIL had an RDF store service available at author, learner and assessor run time? What could they do with that?
    Making an RDF store look like a tuplestore might be doable. Our references are based on 64 bit IDs allocated on a per-store basis,
    whereas RDF uses URLs. And canonical RDF breaks things up at a finer granularity than we do. But it could work.

Jeremey Roschelle of SRI asks about the possible scaling of KF tupl base to 10,000 users

Jeremy's question

  • How much will it cost to scale the database underlying knowledge forum to reliably support 100 schools with 100 users each? explain the architecture, the established facts and the assumptions in your estimate.

Andrew G. replies:
I mentioned to Jim in our conversation on Friday that we were looking at supporting 10,000 connected users and thought that the details of how that might be feasible would be of interest. A month's gone by and the ZTSWatcherProxy discussed below has been checked into zoolib cvs. Although I can't actually test it with real users, its behavior so far seems to support the claim that 10,000 connections is doable.

For now I'm going to punt on the cost part of the question, and restrict myself to a discussion of what we're currently doing, and what we'd need to change to support 10,000 users (I'm assuming that this means handling a peak load of 10,000 simultaneous connections.)

For some background and to establish nomenclature, it would be helpful if you were to read the attached document, the first few
sections down to "What is a tuplebase?" (at line 90).

The underlying storage mechanism is called a tuplestore. A tuplestore provides three services:
1. It hands out IDs from a 64 bit address space. Once issued an ID is guaranteed never to be issued again. IDs can be used to identify tuples in the tuplestore, or for other purposes.
2. Given an ID it returns the tuple stored at that ID, or it can replace the tuple stored at that ID.
3. Given a specification it efficiently returns the IDs of all tuples satisfying that specification.

The KF HTML interface exclusively uses the tuplebase API to work with the content of a tuplestore. The tuplebase API puts a transactional interface over a tuplestore, and implements queries that use values retrieved from tuple properties to further interrogate the tuplestore.

The Java applet primarily uses the ZTSoup API. This lets the applet register its interest in individual tuples and in the result lists of queries, and be promptly notified of changes in both. It consolidates the registrations issued by different parts of the applet and ultimately does its work by using the ZTSWatcher API.

ZTSWatcher provides two services:
1. It hands out IDs from the 64 bit address space.
2a. It updates the list of tuples and queries in which the caller is interested.
2b. It accepts writes against tuples.
2c. It returns the contents of registered tuples and query result lists which have changed.

As implied by the numbering, the second service (imaginatively called 'DoIt') is atomic – registration changes, writes to tuples, and consequent reports of changes all occur in single hit. This leverages the model whereby we tend to add new tuples rather than rewrite existing ones, and where there are many more readers of any particular tuple than writers.

There are two ZTSWatcher implementations. One is part of a tuplestore implementation, the other does its work by talking over a comms link to a ZTSWatcherServer instance.

The architecture is thus:
ZTSoup->ZTSWatcher_Client
=>ZTSWatcherServer->ZTS_Watchable::Watchern>1ZTS_Watchable>ZTS_RAM

Where '>' is an in-process connection, '=>' is a connection over a comms channel, and n>1 indicates a many to one connection.

So, a client uses a ZTSoup instance, which does its work by making requests of an in-memory ZTSWatcher_Client instance.
ZTSWatcher_Client talks over a comms channel to a ZTSWatcherServer instance, which passes on the requests to a ZTSWatcher that's created and managed by the ZTS_Watchable tuplestore. ZTS_Watchable does the actual work of managing registrations, detecting changes and returning updated information.

KF performance Benchmarks
----------

Other limits and issues
-----------------------
Limit: Linux defaults to allowing 1024 file descriptors per process.
We currently open two TCP streams per client connection, so we'd
exhaust the file descriptor limit with 512 clients.

Solution: Multiplexing a client's communications over a single TCP
stream increases the number of connections on an unconfigured box to
1024. Changing system settings lets us go beyond 1024.

Limit: A 32-bit process has 3GB of virtual address space available to
it. We currently spawn three threads per connection, so we'd exhaust
our address space with 1024 connections.

Solution: The comms code is very amenable to be being serviced by
threads drawn from a pool, rather than dedicating one or more threads
per connection.

Limit: There are several attributes of Linux Kernel 2.4 that are
problematic with very large numbers of processes/threads or sockets.

Solution: Linux 2.6 is significantly better at handling high loads.

Issue: We're using a suite of synchronization primitives that are
safer that strictly necessary (can be disposed whilst being waited
on, support recursive acquisition), and thus are less efficient than
they could be.

Solution: Use lighter-weight primitives in hot spots.

Scaling to larger user populations
----------------------------------
The approach we're prototyping uses a proxy ZTSWatcher. From a
caller's perspective this behaves like a regular ZTSWatcher, but does
its work by aggregating the registrations, writes and reads from
multiple clients and passes them to a root server in batches. The
overlapping interests of connected clients means that the load (CPU
and bandwidth) seen by the root server is a fraction of what it would
see if the clients connected directly, and more importantly a lot of
work can be handled by a proxy without ever contacting the server.

In the posited scenario one topology would place a proxy server at
each school to which the school's users connect. The proxies would
connect to a root server at an appropriate location. The proxies are
thus servicing 100 connections, with a workload similar to what a
regular server would experience from 100 connections, so a single
processor box would be adequate.

The root server would also be handling 100 connections, although here
the workload for each would be higher than that imposed by a regular
client. However it would not be 100 times higher, because the proxies
are consolidating some requests and fielding others without touching
the server at all. And there would still be overlapping interests
across the proxies as a whole, which would benefit from the root
server's caching.

The architecture becomes:
ZTSoup->ZTSWatcher_Client
=>ZTSWatcherServer->ZTSWatcherProxy::Watchern->1
ZTSWatcherProxy->ZTSWatcher_Client
=>ZTSWatcherServer->ZTS_Watchable::Watchern->1
ZTS_Watchable->ZTS_RAM
The same as before, from the perspective of the client and the
server, but a proxy has been interposed.

More work needs to be done to determine the actual workloads that
would be experienced by a root server, but it seems feasible that the
proxy scheme could actually be multi-leveled, where clients connect
to a proxy, which connects to a lower-level proxy, and so forth,
ultimately reaching a root server.

ZTSoup->ZTSWatcher_Client
=>ZTSWatcherServer->ZTSWatcherProxy::Watchern->1
ZTSWatcherProxy->ZTSWatcher_Client
=>ZTSWatcherServer->ZTSWatcherProxy::Watchern->1
ZTSWatcherProxy->ZTSWatcher_Client
=>ZTSWatcherServer->ZTS_Watchable::Watchern->1
ZTS_Watchable->ZTS_RAM

Supporting hundreds of thousands or millions of connections will
require a more sophisticated implementation for the back-end, where
indices and storage are spread across a cluster of machines.

  • No labels