Recoll has an Application Programming Interface, usable both for indexing and searching, currently accessible from the Python language.
Another less radical way to extend the application is to write input handlers for new types of documents.
The processing of metadata attributes for documents
        (fields) is highly configurable.
filters,
        which is still reflected in the name of the directory which
        holds them and many configuration variables. They were named
        this way because one of their primary functions is to filter
        out the formatting directives and keep the text
        content. However these modules may have other behaviours, and
        the term input handler is now progressively
        substituted in the documentation. filter is
        still used in many places though.Recoll input handlers cooperate to translate from the multitude of input document formats, simple ones as opendocument, acrobat), or compound ones such as Zip or Email, into the final Recoll indexing input format, which is plain text. Most input handlers are executable programs or scripts. A few handlers are coded in C++ and live inside recollindex. This latter kind will not be described here.
There are currently (1.18 and since 1.13) two kinds of external executable input handlers:
Simple exec handlers
            run once and exit. They can be bare programs like
            antiword, or scripts using other
            programs. They are very simple to write, because they just
            need to print the converted document to the standard
            output. Their output can be plain text or HTML. HTML is
            usually preferred because it can store metadata fields and
            it allows preserving some of the formatting for the GUI
            preview.
Multiple execm handlers
	    can process multiple files (sparing the process startup
	    time which can be very significant), or multiple documents
	    per file (e.g.: for zip or
	    chm files). They communicate
	    with the indexer through a simple protocol, but are
	    nevertheless a bit more complicated than the older
	    kind. Most of new handlers are written in
	    Python, using a common module
	    to handle the protocol. There is an exception,
	    rclimg which is written in Perl. The
	    subdocuments output by these handlers can be directly
	    indexable (text or HTML), or they can be other simple or
	    compound documents that will need to be processed by
	    another handler.
In both cases, handlers deal with regular file system files, and can process either a single document, or a linear list of documents in each file. Recoll is responsible for performing up to date checks, deal with more complex embedding and other upper level issues.
A simple handler returning a
          document in text/plain format, can transfer
          no metadata to the indexer. Generic metadata, like document
          size or modification date, will be gathered and stored by
          the indexer.
Handlers that produce  text/html
          format can return an arbitrary amount of metadata inside HTML
          meta tags. These will be processed
          according to the directives found in 
          the 
            fields configuration
            file.
The handlers that can handle multiple documents per file
          return a single piece of data to identify each document inside
          the file. This piece of data, called
          an ipath element will be sent back by
          Recoll to extract the document at query time, for previewing,
          or for creating a temporary file to be opened by a
          viewer.
The following section describes the simple
          handlers, and the next one gives a few explanations about
          the execm ones. You could conceivably
          write a simple handler with only the elements in the
          manual. This will not be the case for the other ones, for
          which you will have to look at the code.
Recoll simple handlers are usually shell-scripts, but this is in no way necessary. Extracting the text from the native format is the difficult part. Outputting the format expected by Recoll is trivial. Happily enough, most document formats have translators or text extractors which can be called from the handler. In some cases the output of the translating program is completely appropriate, and no intermediate shell-script is needed.
Input handlers are called with a single argument which is the source file name. They should output the result to stdout.
When writing a handler, you should decide if it will output plain text or HTML. Plain text is simpler, but you will not be able to add metadata or vary the output character encoding (this will be defined in a configuration file). Additionally, some formatting may be easier to preserve when previewing HTML. Actually the deciding factor is metadata: Recoll has a way to extract metadata from the HTML header and use it for field searches..
The RECOLL_FILTER_FORPREVIEW environment
        variable (values yes, no)
        tells the handler if the operation is for indexing or
        previewing. Some handlers use this to output a slightly different
        format, for example stripping uninteresting repeated keywords (ie:
        Subject: for email) when indexing. This is not
        essential.
You should look at one of the simple handlers, for example rclps for a starting point.
Don't forget to make your handler executable before testing !
If you can program and want to write
          an execm handler, it should not be too
          difficult to make sense of one of the existing modules. For
          example, look at rclzip which uses Zip
          file paths as identifiers (ipath),
          and rclics, which uses an integer
          index. Also have a look at the comments inside
          the internfile/mh_execm.h file and
          possibly at the corresponding module.
execm handlers sometimes need to make
          a choice for the nature of the ipath
          elements that they use in communication with the
          indexer. Here are a few guidelines:
          
Use ASCII or UTF-8 (if the identifier is an integer print it, for example, like printf %d would do).
If at all possible, the data should make some kind of sense when printed to a log file to help with debugging.
Recoll uses a colon (:) as a
                separator to store a complex path internally (for
                deeper embedding). Colons inside
                the ipath elements output by a
                handler will be escaped, but would be a bad choice as a
                handler-specific separator (mostly, again, for
                debugging issues).
          In any case, the main goal is that it should
          be easy for the handler to extract the target document, given
          the file name and the ipath
          element.
execm handlers will also produce
          a document with a null ipath
          element. Depending on the type of document, this may have
          some associated data (e.g. the body of an email message), or
          none (typical for an archive file). If it is empty, this
          document will be useful anyway for some operations, as the
          parent of the actual data documents.
There are two elements that link a file to the handler which should process it: the association of file to MIME type and the association of a MIME type with a handler.
The association of files to MIME types is mostly based on
        name suffixes. The types are defined inside the
        
        mimemap file. Example:
.doc = application/msword
If no suffix association is found for the file name, Recoll will try to execute the file -i command to determine a MIME type.
The association of file types to handlers is performed in
      the 
      mimeconf file. A sample will probably be
      of better help than a long explanation:
[index]
application/msword = exec antiword -t -i 1 -m UTF-8;\
     mimetype = text/plain ; charset=utf-8
application/ogg = exec rclogg
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
application/x-chm = execm rclchm
The fragment specifies that:
application/msword files
            are processed by executing the antiword
            program, which outputs
            text/plain encoded in
            utf-8.
application/ogg files are
            processed by the rclogg script, with
            default output type (text/html, with
            encoding specified in the header, or utf-8
            by default).
text/rtf is processed by
            unrtf, which outputs
            text/html. The 
            iso-8859-1 encoding is specified because it
            is not the utf-8 default, and not output by
            unrtf in the HTML header section.
application/x-chm is processed
	      by a persistant handler. This is determined by the
	      execm keyword.
The output HTML could be very minimal like the following example:
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
  </head>
  <body>
   Some text content
  </body>
</html>
          
You should take care to escape some
          characters inside the text by transforming them into
          appropriate entities. At the very minimum,
          "&" should be transformed into
          "&", "<"
          should be transformed into
          "<". This is not always properly
          done by translating programs which output HTML, and of
          course never by those which output plain text. 
When encapsulating plain text in an HTML body, 
          the display of a preview may be improved by enclosing the
          text inside <pre> tags.
The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.
Recoll will process meta tags inside
          the header as possible document fields candidates. Documents
          fields can be processed by the indexer in different ways,
          for searching or displaying inside query results. This is
          described in a following
          section.
        
By default, the indexer will process the standard header
          fields if they are present: title,
          meta/description,
          and meta/keywords are both indexed and stored
          for query-time display.
A predefined non-standard meta tag
          will also be processed by Recoll without further
          configuration: if a date tag is present
          and has the right format, it will be used as the document
          date (for display and sorting), in preference to the file
          modification date. The date format should be as follows:
          
<meta name="date" content="YYYY-mm-dd HH:MM:SS">
or
<meta name="date" content="YYYY-mm-ddTHH:MM:SS">
          Example:
<meta name="date" content="2013-02-24 17:50:00">
          
Input handlers also have the possibility to "invent" field names. This should also be output as meta tags:
<meta name="somefield" content="Some textual data" />
You can embed HTML markup inside the content of custom
        fields, for improving the display inside result lists. In this
        case, add a (wildly non-standard) markup
        attribute to tell Recoll that the value is HTML and should not
        be escaped for display.
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
As written above, the processing of fields is described in a further section.