was designed to fit in with the existing Web architecture. There are
near-term and long-term advantages to using the standard protocols and browsers.
The annotator is immediately accessible by anyone having Web access and
a standard Web browser. Over the long term, as protocols improve and the
browser wars continue, our backlinks extension benefits from the development
efforts of others.
The annotator hooks in to the Web through the proxy interface. The configuration
shown below is common and well supported. Proxies exist to funnel all requests
through a firewall; incoming responses are restricted to a single host:
Proxies have the useful property of chaining: a network can include a single
proxy as well as a series of proxies. This property led to the following
The proxy interface provides the hooks needed to add backlinks to the Web
(see Proxies and Firewalls). The annotator,
as a member of the proxy chain, can listen to outgoing requests and intercept
incoming responses. This simple program (roughly, 20 pages of C code) does
little more than what is required to annotate documents. Rather than implement
the full proxy interface, it runs with a proxy server. For example, on my
laptop, I run the annotator with the Apache Web server configured as a proxy
server. Together they make an "annotating proxy".
When the annotator receives a browser request, it simply passes this along
to the proxy server. The document is fetched and returned to the annotator.
At this point, before the annotator returns the document to the browser,
annotations are added.
The annotator, having received a document, searches available annotation
sets for links to this document. Annotations are specified with:
the URL of the document being annotated
a text pattern
the text written by the author of the annotation
A backlink is added for each annotation found. If a pattern is given, the
document is searched for matching text (a single word, a phrase, a paragraph)
and if found, the backlink is attached here. If no pattern is given or if
matching text is not found, the backlink is inserted at the beginning of
The annotator generates the HTML for the backlinks using standard HTML tags,
interleaves the new HTML with the original, and returns the document to
the browser for viewing. (The original document is not modified; the annotations
appear only in the presentation.)
Proxies and Firewalls
The prevalence of proxy-based firewalls led to browser support for proxy
redirection. By simply setting a browser option, the user enables an efficient,
transparent mechanism for the redirection of requests.
During the alpha phase, use of the browser option is limited. This option
cannot be used if a firewall is between the browser and the annotating proxy.
Fortunately, there is an alternative: pseudo proxies. This is a hack, but
it is used in implementing a number of useful extensions, in particular,
anonymizers use it to shield the identity of users browsing the Web.
Instead of transparent redirection, the pseudo proxy's URL is prepended
to the URL being requested. This pseudo proxy sees all requests and intercepts
all responses. To make the initial "connection" the user manually
edits the requested URL, but only for the first request. The pseudo proxy
fetches the document, but before returning it, it scans the document for
links and rewrites the URLs to point at itself. If the user follows a link
from this document, the request is redirected to the pseudo proxy.
If the browser option limitation during alpha is unacceptable, the annotating
proxy can easily be modified to implement the pseudo-proxy technique.
A post-alpha option is to port the Linux-based annotating proxy to desktop
machines. This "Personal Annotator" could run on the Windows,
Macintosh, or Java platforms.
The basic summary for how to make annotation sets scale is `spread it out!'
The annotation system above does not have good scaling properties because
it is too centralized. However, an annotation system in the future would
be much more decentralized and do all of the annotation either directly
in the browser or in a process running on or near the user's machine. Let's
talk about how this future annotation system would scale.
For each annotation set, sort all of the URL's for annotated documents.
Further reduce this list to just the names of the annotated hosts. When
the annotator starts up, it fetches this host name list and stores it locally.
For everything except the most gargantuan/humungous/enormous annotation
sets, this is a fairly modest amount of data to fetch. The host name list
can be cached in the file system somewhere so that it does not have to be
refetched each time the user reboots his machine. This step is payed only
at annotator start-up time.
The annotator merges all of the sorted host name lists for all of
its annotation sets into a single sorted list.
When the annotator is presented with a URL to fetch, it performs a
binary search of the merged host name list to figure out if there are any
annotation sets that may pertain.
If there is a match after the binary search, the annotator now knows
which annotations sets may have annotations for the requested document.
The annotator goes to each annotation set and fetches the list of
the documents for the host that have annotations. Except for truly large
sites that serve up millions of documents, this will be down-loaded fairly
Now the annotator knows whether or not the document that has been
requested has any annotations. If so, it goes off and fetches the specific
annotations, merges them in and returns the modified document to the web-browser.
Please note that in this process, that there is an initial pause when the
first document is fetched from a given host (to down load the annotated
document list.) As the user bounces around the web site, he gets quick response
time since the annotated document list has already been downloaded.
Further scaling issues:
Really popular annotation sets will get a lot of hits (just like popular
web sites). How do we deal with this? There are two answers -- geographic
distribution and load balancing:
Geographic distribution is simple--just put up a mirrored annotation
set server at multiple geographic sites. Example, one in Europe, a couple
in the US, one in Japan and one in Australia. When an annotator first visits
an annotation set server, it asks 'where are your geographic mirroring sites?'
It returns the latitude and longitude for each server. The annotator can
compare its latitude and longitude with the various geographic sites and
find the geographically closest one. The network routing people do not like
this answer, since sometimes geographic proximity does not mean network
proximity. Tough! Right now this is how people do geographic distribution;
they ask users to click on the web page that is closest to them.
It may still be the case that a given server in a given geographic
location is getting pounded into oblivion. The solution is to design annotation
sets so that they can load balance. What you do is mirror the annotation
set across N servers. The annotator goes to the geographic site and asks
'how many mirrored sites do you have and what are there names?' Then each
annotator takes its Internet address, computes a hash, and takes the remainder
of dividing it by N and talks to that annotation set server.
The final component to the scaling issue is 'what about gigantic annotation
sets?' An example of a gigantic annotation set is one that attempts to keep
track of all back links. This is an annotation set that would basically
span the entire web. First, you need lots and lots and lots of hardware.
This is essentially what Digital's Alta Vista is trying to do. Second, the
strategy of downloading the host name list is basically a waste of time;
the solution is to not do it. Instead, you go to the annotation set server
each time you visit a host and fetch the annotated documents list. Again,
for sites that have huge document sets, down-loading the list of documents
that are annotated can be a waste of time. Again, the solution is to not
download, but instead get the annotation set server each time you fetch
a document. There is no magic here.
Terry Stanley is the lead programmer on the Annotator project and can
be reached at firstname.lastname@example.org.
For ongoing nanotechnology