On indexing org files
Since indexing org file is kind of en vogue lately, I thought that I should present my own home-brewed
solution. Firstly, I tried to use recoll, but unfortunately I was unable to compile it on OSX. Because I store
all of my notes in one huge org file,
mdfind wasn't a solution ether. John Kitchin gave a lot of
inspiration, but somehow it didn't quite fit. So I had to hack something by myself...
(defpackage :index-cli (:use :cl :net.didierverna.clon) (:import-from :montezuma :add-field :make-field) (:export #:main)) (in-package :index-cli) (defvar *version* "0.1") (defvar *index-path-name* ".org-file-index") (defparameter *index-path* (merge-pathnames *index-path-name*)) (defun make-index (&key create-p) (make-instance 'montezuma:index :path *index-path* :create-p create-p :analyzer (make-instance 'montezuma:standard-analyzer) :default-field "*")) (defun search-index (query) (let ((index (make-index :create-p nil))) (when (= 0 (montezuma:search-each index query #'(lambda (doc-id score) (declare (ignore score)) (format t "~A [~A|~A]~%" (montezuma:document-value (montezuma:get-document index doc-id) 'title) (montezuma:document-value (montezuma:get-document index doc-id) 'file) (montezuma:document-value (montezuma:get-document index doc-id) 'buffer-position))) '(:num-docs 100000))) (format t "Not found~%")))) (defun index-file (filespec) (let ((all-docs (with-open-file (f filespec) (read f))) (index (make-index :create-p t))) (dolist (doc all-docs) (montezuma:add-document-to-index index doc (make-instance 'montezuma:standard-analyzer))) (montezuma:optimize index) (montezuma:close index))) ;;; CLI (defsynopsis (:postfix "FILE") (text :contents "Index alist file") (path :long-name "index-path" :short-name "i" :type :directory :description "Index location. Defaults to current directory.") (stropt :long-name "query" :short-name "q" :description "Search terms.") (flag :long-name "help" :short-name "h" :description "Show this help text.") (flag :long-name "version" :short-name "v" :description "Show the version.")) (defun main (args) (setf sb-impl::*default-external-format* :utf-8) (make-context :cmdline args) (when (getopt :short-name "h") (help) (sb-ext:exit)) (when (getopt :short-name "v") (format t "index-cli v~A~%" *version*) (sb-ext:exit)) (let ((index-path (getopt :short-name "i")) (query (getopt :short-name "q"))) (when index-path (setf *index-path* (merge-pathnames *index-path-name* index-path))) (if query (search-index query) (index-file (merge-pathnames (car (remainder)))))))
The most important points are the
index-file and the
search-index functions. The file passed to
index-file should contain a list of alist. It iterates over this list and passes each alist to
Montezuma. Simple as that. The alists contains the following data:
- Usually the org headline.
- A paragraph's text. Meaning the text in between two org headlines.
- The absolute file path to the file where title or contents has been extracted.
- The position of the headline or paragraph in the buffer. Knowing that, it is easy to jump directly to the occurrence of the search term. This will be important later.
((title . "An awesome title") (contents . "Much much content...") (file . "/Users/lispm/org/kb.org") (buffer-position . 311518))
Each alist is a document in Lucene/Montezuma lingo. Note that contents is optional actually. This is because I'm indexing headlines and paragraphs independently. But when I index a paragraph I put it's parent-headline into the document as well. Admittedly this indexes the same headline multiple times which is unnecessary actually, but I'm fine with that.
search-index function applies the given query to the index an is explained quickly. If documents for
this query are found, each document gets printed to standard-output. For example, querying for
awesome title writes
the above document like the following:
An awesome title [/Users/lispm/org/kb.org|311518]
This output format still applies if the search term has been found inside the contents of a document. Thus I've indexed the headline for paragraphs as well. To finalize the indexer, I've created an executable with buildapp and gave it a primitive CLI. The -i option tells it where the index should be stored. The -q option searches the index. Omitting -q and passing the documents file (the list of alists) indexes it.
The documents creator
Next I needed a script that transforms my org file into a list of indexable documents.
;;; use the latest org (add-to-list 'load-path "/Users/fyi/.emacs.d/elpa/org-20150727/") (require 'org) (require 'xml) (require 'json) (defun parent-headline (element) (let ((parent (org-element-property :parent element))) (cond ((null parent) "") ((and (eq 'headline (org-element-type parent)) (eql 1 (org-element-property :level parent))) (org-element-property :raw-value parent)) (t (parent-headline parent))))) (find-file (expand-file-name (car command-line-args-left))) (princ (append (org-map-entries (lambda () (let* ((headline (org-element-at-point)) (title (org-element-property :title headline))) (if (equal title "Website Summary:") "" (format "((title . %S) (file . %S) (buffer-position . %s))" title (buffer-file-name) (prin1-to-string (org-element-property :begin headline))))))) (org-element-map (org-element-parse-buffer) 'paragraph (lambda (paragraph) (format "((title . %S) (contents . %S) (file . %S) (buffer-position . %s))" (parent-headline paragraph) (xml-escape-string (buffer-substring-no-properties (org-element-property :contents-begin paragraph) (org-element-property :contents-end paragraph))) (buffer-file-name) (prin1-to-string (org-element-property :begin paragraph)))))))
The above script takes an org file and maps over each headline and paragraph an produces an alist
accordingly. Nothing special about it. I didn't know about the
org-element-API before, which is quite nice.
Lastly, everything get swiper-integrated to be usable from inside Emacs.
(defun counsel-kb-search-function (string &rest unused) "Lookup STRING with index-cli" (if (< (length string) 3) (counsel-more-chars 3) (counsel--async-command (format "index-cli -i /Users/lispm/org -q \"%s\"" string)) nil)) (defun search-kb (&optional initial-input) "Search KB" (interactive) (ivy-read "search KB: " 'counsel-kb-search-function :initial-input initial-input :dynamic-collection t :history 'counsel-git-grep-history :action (lambda (x) (when (string-match ".*\\[\\(.*\\)|\\([[:digit:]]+\\)\\]" x) (let ((file-name (match-string 1 x)) (point (string-to-number (match-string 2 x)))) (find-file file-name) (goto-char point) (org-show-entry) (show-children))))))
The regexp matches the output of the indexer. Because I've indexed the file path and the buffer position I can jump directly to the search term from swiper. Which is exactly what I wanted. See a demo below: