Indexing is the process by which the set of documents is
	analyzed and the data entered into the database. Recoll
	indexing is normally incremental: documents will only be
	processed if they have been modified since the last run. On
	the first execution, all documents will need processing. A
	full index build can be forced later by specifying an option
	to the indexing command (recollindex
	-z or -Z).
recollindex skips files which caused an
      error during a previous pass. This is a performance
      optimization, and a new behaviour in version 1.21 (failed files
      were always retried by previous versions). The command line
      option -k can be set to retry failed files, for
      example after updating a filter.
The following sections give an overview of different aspects of the indexing processes and configuration, with links to detailed sections.
Depending on your data, temporary files may be needed during
	indexing, some of them possibly quite big. You can use the
	RECOLL_TMPDIR or TMPDIR environment
	variables to determine where they are created (the default is to
	use /tmp). Using TMPDIR has
	the nice property that it may also be taken into account by 
	auxiliary commands executed by recollindex.
Recoll indexing can be performed along two different modes:
Periodic (or batch) indexing: indexing takes place at discrete times, by executing the recollindex command. The typical usage is to have a nightly indexing run programmed into your cron file.
Real time indexing: indexing takes place as soon as a file is created or changed. recollindex runs as a daemon and uses a file system alteration monitor such as inotify, Fam or Gamin to detect file changes.
The choice between the two methods is mostly a matter of preference, and they can be combined by setting up multiple indexes (ie: use periodic indexing on a big documentation directory, and real time indexing on a small home directory). Monitoring a big file system tree can consume significant system resources.
The choice of method and the parameters used can be configured from the recoll GUI: →
The parameters describing what is to be indexed and local preferences are defined in text files contained in a configuration directory.
All parameters have defaults, defined in system-wide files.
Without further configuration, Recoll will index all appropriate files from your home directory, with a reasonable set of defaults.
A default personal configuration directory
	  ($HOME/.recoll/) is created
	  when a Recoll program is first executed. It is possible to
	  create other configuration directories, and use them by
	  setting the RECOLL_CONFDIR environment
	  variable, or giving the -c option to any of
	  the Recoll commands.
In some cases, it may be interesting to index different areas of the file system to separate databases. You can do this by using multiple configuration directories, each indexing a file system area to a specific database. Typically, this would be done to separate personal and shared indexes, or to take advantage of the organization of your data to improve search precision.
The generated indexes can be queried concurrently in a transparent manner.
For index generation, multiple configurations are totally independant from each other. When multiple indexes need to be used for a single search, some parameters should be consistent among the configurations.
Recoll knows about quite a few different document types. The parameters for document types recognition and processing are set in configuration files.
Most file types, like HTML or word processing files, only hold one document. Some file types, like email folders or zip archives, can hold many individually indexed documents, which may themselves be compound ones. Such hierarchies can go quite deep, and Recoll can process, for example, a LibreOffice document stored as an attachment to an email message inside an email folder archived in a zip file...
Recoll indexing processes plain text, HTML, OpenDocument (Open/LibreOffice), email formats, and a few others internally.
Other file types (ie: postscript, pdf, ms-word, rtf ...) 
          need external applications for preprocessing. The list is in the
           installation
          section. After every indexing operation, Recoll updates a list of
          commands that would be needed for indexing existing files
          types. This list can be displayed by selecting the menu option
	   → 
          in the recoll GUI. It is stored in the
          missing text file inside the configuration
          directory.
By default, Recoll will try to index any file type that it has a way to read. This is sometimes not desirable, and there are ways to either exclude some types, or on the contrary to define a positive list of types to be indexed. In the latter case, any type not in the list will be ignored.
Excluding types can be done by adding wildcard name
          patterns to the skippedNames list, which
          can be done from the GUI Index configuration menu. For
          versions 1.20 and later, you can alternatively set the
          excludedmimetypes list in the
          configuration file. This can be redefined for
          subdirectories.
You can also define an exclusive list of MIME types to be indexed (no others will be indexed), by settting
            the indexedmimetypes configuration
            variable. Example:
indexedmimetypes = text/html application/pdf
          It is possible to redefine this parameter for subdirectories. Example:
[/path/to/my/dir]
indexedmimetypes = application/pdf
          (When using sections like this, don't forget that they remain in effect until the end of the file or another section indicator).
excludedmimetypes or
          indexedmimetypes, can be set either by
          editing the 
          main configuration file
          (recoll.conf), or from the GUI
          index configuration tool.
Indexing may fail for some documents, for a number of reasons: a helper program may be missing, the document may be corrupt, we may fail to uncompress a file because no file system space is available, etc.
Recoll versions prior to 1.21 always retried to index files which had previously caused an error. This guaranteed that anything that may have become indexable (for example because a helper had been installed) would be indexed. However this was bad for performance because some indexing failures may be quite costly (for example failing to uncompress a big file because of insufficient disk space).
The indexer in Recoll versions 1.21 and later do not
        retry failed file by default. Retrying will only occur if an
        explicit option (-k) is set on the
        recollindex command line, or if a script
        executed when recollindex starts up says
        so. The script is defined by a configuration variable
        (checkneedretryindexscript), and makes a
        rather lame attempt at deciding if a helper command may have
        been installed, by checking if any of the common
        bin directories have changed.
In the rare case where the index becomes corrupted (which can
	  signal itself by weird search results or crashes), the index files
	  need to be erased before restarting a clean indexing pass. Just delete
	  the xapiandb directory (see 
	  next section), or, 
	  alternatively, start the next recollindex with the 
	  -z option, which will reset the database before
	  indexing.