Remembrance Agent, version 1.22 beta
Jan-Christian Nelson and Bradley Rhodes, MIT Media Lab
June 3, 1997

Contents
--------
0.  Introduction
1.  Building the Remembrance Agent
2.  Creating a database from a collection of text
3.  Using the Remembrance Agent in Emacs
4.  Savant's document template system and other customizations
5.  What's new?

0.  Introduction
----------------
The Remembrance Agent is one of the projects being developed by 
the MIT Media Lab's software agents group.  Given a collection
of the user's accumulated email, usenet news articles, papers, 
saved HTML files and other text notes, it attempts to find those
documents which are most relevant to the user's current context.
That is, it searches this collection of text for the documents
which bear the highest word-for-word similarity to the text the user
is currently editing, in the hope that they will also bear high
conceptual similarity and thus be useful to the user's current work.  

The Remembrance Agent works in two stages.  First, the user's collection
of text documents is indexed into a database saved in a vector format.
Briefly, the concept behind the vector format is that any given document
may be represented by a vector, each entry of which corresponds to a
single word and is equal to the number of times that word appears
in the document.  The advantages gained are that retrieval from disk
is relatively speedy, and similarity between two documents is simple
to compute:  you take the dot product of their (normalized) vectors.
After the database is created, the other stage of the Remembrance Agent
is run from emacs, where it periodically takes a sample of text from
the working buffer, converts it to this vector format, and finds those
documents from the collection whose dot products with it are highest.
It summarizes the top documents in a small emacs window and allows you
to retrieve the entire text of any one with a keystroke.

The Remembrance Agent was the idea of Thad Starner and Brad Rhodes,
graduate students at the Media Lab, and is the product of long hours
of coding by MIT undergrad Jan Nelson.  They can be reached with 
questions and comments about the Remembrance Agent via email at 
ra-bugs@media.mit.edu, and new versions of the software 
can be found at http://www.media.mit.edu/~rhodes/RA/

Good luck, and thanks for using the Remembrance Agent.

1.  Building the Remembrance Agent
----------------------------------
These are the directories included in the RA distribution:

README:                  this file
src:			 Source for Savant, the RA back-end
remem-display-mode.el:   emacs interface to savant
remem-display-mode.elc:  byte-compiled form of remem-display-mode.el
savantrc:                configuration file for savant (move to ~/.savantrc)
savant.internals:	 file describing how savant works.  This is
			 somewhat out-of-date, but it's a good start.
vector.doc:		 a description of how documents vectors are stored

The RA back-end has two executables, "ra-index" and "retrieve", which
together make the system called Savant.  To build savant, cd to the
RA/src directory off of the main RA directory and type 'make'.  After
the compilation is finished, the executables will be in RA/src.  Move
these files to wherever you normally keep executables, and delete the
source files if you like.  You should also copy the two
remem-display-mode files to wherever you like to keep emacs-lisp
files.  Finally, rename savantrc to .savantrc and move it to your home
directory (it is called savantrc in the distribution so that it won't
accidentally go unnoticed).


2.  Creating a database from a collection of text
-------------------------------------------------
Savant has two executables, one which indexes documents into
databases, and one which performs interactive retrievals from these
databases.  To use the first mode, you must have a set of source
text-files, and a directory savant can put database files into.
Usage:
    ra-index [-v] [-c <config-file>] <base-dir> <source1> [<source2>] ... 
             [-e <excludee1> [<excludee2>] ...]

The <source> arguments may be files or directories.  If a directory is in
the list, savant will use all its contents, recursing into all
subdirectories.  Non-text files and backup files (those appended with ~ or
prepended with #) are ignored.  Any files or directories specified after
the optional -e flag will be excluded.  Savant will use any files it finds
to create a database in the specified base directory, which must already
exist.  The optional -v argument (verbose) will direct savant to keep you
updated on its progress.  So for example,
	ra-index -v ~/Database/src-base ~/src -e ~/src/*.h
will build a database in my Database/src-base directory, made up of
files from my src directory, excluding files ending in '.h', and it will
do this verbosely.

There is also an optional [-c <config-file>] argument, which tells 
savant to use <config-file> instead of the .savantrc in your homedir.
Section 4 of this file details how .savantrc can be modified or how
a new one can be written.

***IMPORTANT***: Savant can build databases in any directory you like, but
the emacs interface for the Remembrance Agent expects a particular
structure.  For each database you want to make, you should create a
directory, and all these directories should live in the same parent
directory.  For example, for my own use I have a directory
~cabbage/Database/, and within that directories ~cabbage/Database/mail/,
~cabbage/Database/papers/, etc. which actually contain the database files.

To see how savant interacts with emacs while the remembrance agent is
running, try running ra-retrieve with the command 'ra-retrieve -v <base-dir>'
after creating a database using index.

3.  Using the Remembrance Agent in Emacs
----------------------------------------
You can load the Remembrance Agent automatically every time you run emacs
by placing the line (load "remem-display-mode") in your .emacs file in your 
homedir.  This assumes that one of remem-display-mode.el or 
remem-display-mode.elc exist in your emacs load-path.  

Before the Remembrance Agent can be used, several variables must be 
configured.  They are set by placing these lines in your .emacs file:
  (setq remem-prog-dir <prog-dir-string>)
  (setq remem-database-dir <database-dir-string>)
  (setq remem-display-scope-databases <3-element list of strings>)

<prog-dir-string> should be the full name of the directory where you put
the ra-retrieve executable, enclosed in double quotes.  In my own use, for
example, I set this to "/.../users/cabbage/bin".

<remem-relevant-database-dir> should be the full name of the directory
which holds your database directories, enclosed in double quotes.  I use,
for example "/.../users/cabbage/Database".  Note that this is the name of a
directory containing directories, not the directory containing the database
files themselves.

remem-relevant-display-scope-databases: the emacs version of the
remembrance agent has three "scopes," separate processes performing
retrievals simultaneously.  remem-relevant-display-scope-databases should
be set to a three-element list of names of sub-directories of the
remem-relevant-database-dir from which these scopes should retrieve.  They
may be repeated.  In my own .emacs, for example, I have this set to
'("mail" "mail" "papers") (note the single quote to indicate a list),
corresponding to /.../users/cabbage/Database/mail and
/.../users/cabbage/Database/papers.  NOTE: You can specify 'nil' instead of
"dirname" for any element of this list, and that scope will not start its
process.  This will save some memory and processor time.

Okay!  After all these customizations are made, you can start the
Remembrance Agent by typing C-c r r.  It will create its window and
after a moment or two begin to display suggestions like:
   1: 0.26 | Golan Levin       | 07 Feb 96 | this mess  
   2: 0.08 | cabbage           | 04 May 95 | 502journal.tex
   3: 0.07 | cabbage           | 06 Dec 94 | timing.tex
This can be summarized as 
 ID#: rating | author or file owner | date | subject or filename

The rating is a measure from 0 to 1 of how relevant the document is to a 
sample of your current buffer.  To see a suggested document, type
C-c <ID# of document>.

further customizations
----------------------
There are three other variables the Remembrance Agent allows you to
customize, and they can be set in the same way as the three above.

remem-display-scope-number-lines is a three-element list, like
remem-display-scope-databases.  Instead of controlling which database
a scope searches from, an element controls how many lines a scope gets 
in the suggestion summary shown above.  The default is '(1 1 1), one line 
for each scope, but you have a total of nine to parcel out as you see fit.
Specifying one of these to be 0 has the same effect as specifying an
element of remem-relevant-scope-databases to be nil, that is, the scope 
is disabled.

remem-display-scope-update-times is a three-element list controlling
how often, in seconds, each scope updates its suggestions.  The
default is '(10 10 10), ten seconds for each scope.

remem-display-scope-range is another three-element list controlling
the size in words of the samples each scope takes when it goes to
find new suggestions.  The default is '(100 100 100), one hundred
words for each scope.


4.  Savant's document template system and other customizations
-------------------------------------------------------------- 
Two types of things appear in .savantrc or another customization file.
There are variables and templates, and there is no distinct section
for either (i.e., variable assignments may be mixed in with templates).
The variables are exceedingly simple: to set one, include in .savantrc
<variable_name> followed by <value>.  At the present, there are only
four variables in use:
  source_field_width - specifies the number of characters allotted to the
    source field of savant's summary lines, the leftmost text-field,
    specifying who authored the document being suggested.  This is the
    only such variable because the date field is a fixed width, and the
    subject field will take up the remaining available space.
  ellipses - may be true or false.  Specifies whether the name in the
    source field should be abbreviated with ellipses ("...") or simply
    truncated.
  document_windowing - may be true or false.  If this is true, each document
    is broken up into overlapping "windows", each lines_per_window lines long
    (see lines_per_window).  The motivation for this is that since very large
    documents may only have a small section that is relevant to the 
    current context, we want to have only that section suggested to us.
  lines_per_window - length in lines of each "window" (section) to break
    documents into.

If you can think of more variables you'd like to see, please let us know
(ra-bugs@media.mit.edu).

Savant uses a system of templates to decide where to mark off the separate
documents in a file and what to use in the summary line emacs displays
to you.  The current version of the package provides templates that
will correctly dissect RMAIL, plain email, usenet articles, LaTeX
documents, and HTML documents.  Those files not recognized are treated 
as one large document.  You need not read this section unless you have
another file format you wich savant to recognize.  If you do create one,
let us know! (ra-bugs@media.mit.edu).

A template has this format:
Template <template-name>
{
  Recognize
    <pattern>
  Format
    <pattern>
}
Patterns consist of strings surrounded by double quotes (which may contain
C escape sequences like \n and control sequences like ^L), the variables
SOURCE, DATE, SUBJECT and BODY, and the commands startline, ignore, optional,
anyorder, anyof and icase, and any pattern may also contain subpatterns
delimited with curly braces.  Savant will use the Recognize pattern
to decide what kind of file it is dealing with, for each file.  Whichever
template gets the earliest match in the file wins.  Then the corresponding
format template is used to start parsing off documents; when it comes
to one of the variable names, text from the current location in the
file will be used to fill that variable.  Here are the definitions of
the keywords - don't worry, an example is coming!

startline: this indicates that the next string or subpattern should
	appear at the start of a line of text (like ^ in regular expressions)
ignore: ignore text until the next specified string or subpattern
optional: the next string or subpattern is optional
anyorder: the remaining strings and subpatterns in the pattern may
	appear in any order
anyof: stop when any of the remaining strings or subpatterns in this
	pattern is found
icase: the next string should be matched case-insensitively
Variables SOURCE, DATE, SUBJECT and BODY:  When savant finds one of
	these, it starts filling the variable with text from the
	file.  It stops at the next pattern item it finds.  During 
	retrieval, the first three variables are used in the 
	emacs summary line like so:
	# | 0.xx | SOURCE | DATE | SUBJECT
	and BODY is the text returned when you hit C-#.

So let's look at RMAIL's format pattern from .savantrc.
  Format 
    {"^_^L\n", ignore, startline, "*** EOOH ***\n",
     {anyorder, {startline, "From: ", SOURCE, "\n"},
               {startline, "Date: ", DATE, "\n"},
               optional {startline, "Subject: ", SUBJECT, "\n"}},
     "\n\n", BODY}

What savant starts looking for is "^_^L\n"; when this is found it ignores
whatever text comes next until "*** EOOH ***\n" is found at the head of
a line.  Next is a subpattern, which is declared to be orderless by the
anyorder command.  It contains three subpatterns, the third of which
is optional, each of which looks for a string at the head of a line,
and upon finding it, fills a variable with the text following the
string, up until the next newline character.  This will result in 
SOURCE getting the From: line, DATE getting the Date: line, and SUBJECT
getting the Subject: line (some email has no subject, so this last one
is optional.  If an optional variable is not filled, then the file 
owner is used for SOURCE, or file creating time for DATE, or filename 
for SUBJECT).  Finally, after everything in the anyorder subpattern
has been dealt with, it goes on to look for a blank line (\n\n), and
the remainder of the document (up until the next document) is the BODY.

By the way, all commands and variable names are case insensitive.
If you experiment with creating new templates, be forewarned that 
there is very little if any syntax checking built in right now,
so try to observe the style used in the five templates provided.
Feel free to email us for tips and pointers, at
remembrance-maintainer@media.mit.edu

5.  What's New?
---------------
Version 1.0b   -  June 1996.  First stable public release.
Version 1.01b  -  September 1996.  Fixed a documentation bug, 
Version 1.1b   -  October 1996.  Fixed a bug causing common words not to be 
		  discarded.  Databases will now be smaller by a significant
		  factor, and faster to search by an even greater factor.  
                  Also fixed a bug where (stupidly) I was reading index -1 of
                  an array, often causing core dumps.  Lastly, we added a 
                  hook to emacs rmail-mode which will help the mail interface
                  ignore mail headers.  This should be easily extendable to any
	          emacs mail package, so send me mail about your favorite one.
Version 1.11b  -  November 1996.  Here I fixed the -1 bug mentioned above, this 
		  time for real.  Also made savant to strip document titles of
		  any control codes, and implemented a new system of indexing
		  a document's subject along with its body.
Version 1.2b   -  March 1997.  Seperated the "savant" program in to "index"
                  and "retrieve" to make them both more lightweight.  Also
                  cleaned up a lot of the internals, making it easier to
                  integrate savant into other programs.
Version 1.22b  -  June 1997.  Fixed several memory leaks, and added some
                  hooks for future development.

Jan-Christian Nelson
