12. Parsing

Sphinx uses docutils parsers to deal with the normal reST input text. In O-O terms, it utilizes polymorphic state machines, choosing parsers instantiated from the right parser sub-class based on the portion of the document being processed.

Inside calls to these state machines is a reference to self (the parser state machine object), and that object normally has an attribute document which is an accumulation of the docutils node objects being built as an abstract syntax tree containing the content of the .rst file being parsed. In fact, most objects involved have an attribute document.

Also, other types of objects (e.g. domain objects) that don’t have a document attribute come with an env (environment) attribute, which itself has an attribute called current_document which is a reference to the same doctree (tree of docutils node objects).

How does it know where to insert nodes? The parsing goes from top to bottom, so it is always “at the end”, i.e. a node.append() call is used to add them, and this method is defined by the Node class itself (high-level node ancestor in the inheritance hierarchy), which adds the node list passed to it to its children attribute, which is a list object.

Under the debugger, how can you tell where it is parsing? The document attribute has its own attribute current_line which has the line number, and when parsing is taking place, sometimes there is an argument context which provides the text from that line.

check_line() is a commonly-used function name that deals with the line contents inside the state machine.

How does it know what node to append to? The target object self has an attribute parent which carries the node whose children node list being appended to. Have a look at this example from the RSTState class’ text() method which parses an identified paragraph. (RSTState class id defined in docutils\parsers\rst\states.py.)

def text(self, match, context, next_state):
    """Paragraph."""
    startline = self.state_machine.abs_line_number() - 1
    msg = None
    try:
        block = self.state_machine.get_text_block(flush_left=True)
    except statemachine.UnexpectedIndentationError as err:
        block, src, srcline = err.args
        msg = self.reporter.error('Unexpected indentation.',
                                  source=src, line=srcline)
    lines = context + list(block)
    paragraph, literalnext = self.paragraph(lines, startline)
    self.parent += paragraph
    self.parent += msg
    if literalnext:
        try:
            self.state_machine.next_line()
        except EOFError:
            pass
        self.parent += self.literal_block()
    return [], next_state, []

The context argument contains the first line of the paragraph. self.state_machine.abs_line_number() returns the next line to be parsed. Thus

startline = self.state_machine.abs_line_number() - 1

sets startline to the one-based document line number of the first line of the paragraph. Then

block = self.state_machine.get_text_block(flush_left=True)

slurps up the rest of the paragraph into the block object, which is a StringList object containing attributes data (list of string lines), items (item[0] is a tuple where item[0][0] contains the full path to the source document, and item[0][1] contains the paragraphs one-based starting line number within that document (same as startline mentioned above), and parent is the section node this paragraph is being added to. The end of the paragraph is recognized by the beginning of the next paragraph: a blank line followed by a line starting with flush-left text, or the end of the document, whichever comes first.

Then lines a list created from all lines of the paragraph, which is then sent to

paragraph, literalnext = self.paragraph(lines, startline)

to be converted to a list of nodes in paragraph and literalnext which is a Boolean value indicating whether \\ was found at the beginning of the paragraph.

Note that the paragraph variable created is populated with a list of text and in-line element nodes created by this call:

paragraph, literalnext = self.paragraph(lines, startline)

Then that node list is appended to the target (section ) node in these lines:

self.parent += paragraph
self.parent += msg

paragraph is a list of Node objects (polymorphic) and msg is a list of strings which contain human-readable messages applicable to any parsing errors if there were any.

12.1. Definition Lists

Sphinx outputs HTML Definition Lists for a number of different types of document structures. It is important to understand what to expect so that custom formatting of these structures is more straightforward.

The different types of document structures are:

Any of the above can also have the class “simple” added to the list of classes of the list when all the <dd> elements have no nested formatting such as sub-lists, tables, etc..

Definition lists look like this in HTML:

Listing 12.1.1 A Plain Definition List
<dl>
    <dt>term1</dt>
    <dd>words of definition 1</dd>
    <dt>term2</dt>
    <dd>words of definition 2</dd>
    ...
</dl>

Because all <dd> elements are “simple”, Sphinx would add class “simple” to the <dl> element like this: <dl class="simple">.

The default browser formatting for the above looks like this:

term1
words of definition 1
term2
words of definition 2
...

12.1.1. Furo Theme

The Furo Theme may or may not be alone in this, but I can say that it takes the above and does CSS formatting on them as follows:

  • All <dl> elements that DO NOT have the following classes are formatted first.

    • option-list

    • field-list

    • footnote

    • glossary

    • simple

You can find them by looking for (regex) \bdd, and they look like this (but without the comments):

/* All `dd` descendants (at any level) */
#furo-main-content dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) dd {
  margin-left:2rem
}

/* First direct-child elements of `dd` descendants (at any level) */
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) dd>:first-child {
  margin-top:.125rem
}

/* Last direct-child elements of `dd` descendants (at any level),
 * and also descendant "field-list" classes (Sphinx only applies "field-list" class to <dl> elements). */
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) .field-list,
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) dd>:last-child {
  margin-bottom:.75rem
}

/* Direct-child <dt> elements of descendant elements "field-list" classes. */
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) .field-list>dt {
  font-size:var(--font-size--small);
  text-transform:uppercase
}

/* Empty <dd> elements of descendant elements "field-list" classes. */
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) .field-list dd:empty {
  margin-bottom:.5rem
}

/* <ul> elements nested in <dd> elements of descendant elements "field-list" classes. */
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) .field-list dd>ul {
  margin-left:-1.2rem
}

/* <ul> elements nested in <dd> elements of descendant elements "field-list" classes. */
dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) .field-list dd>ul>li>p:nth-child(2) {
  margin-top:0
}

dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple) .field-list dd>ul>li>p+p:last-child:empty {
  margin-bottom:0;
  margin-top:0
}

dl[class]:not(.option-list):not(.field-list):not(.footnote):not(.glossary):not(.simple)>dt {
  color:var(--color-api-overall)
}