5.1. On-disk layout

Everything the TOA server keeps for itself lives under toa-server.dataRoot (see Server settings). The layout is plain folders and files - no database - so an operator with shell access can answer support questions and do manual cleanup with ordinary tools.

5.1.1. Top-level structure

<dataRoot>/
  <domain.code>/                  one folder per configured domain
    cmserver2.xml                 cached templates (URL-based domains only)
    yyyy-MM-dd/                   one folder per server-local calendar day
      HHmm-xxxxxxxx/              one folder per import (HHmm + 8 hex chars)
        import.json
        doc-1/
          pages/
            page-1.bin
            page-1.meta.json      EML pages only
            page-2.bin
            ...
        doc-2/                    sibling document, see below
          pages/
            page-1.bin
            ...

The per-domain folder is created on startup. The yyyy-MM-dd folder is created the first time an import lands on that calendar day. The HHmm-xxxxxxxx folder is created when the add-in calls POST /import/<domain>. Nothing else writes into <dataRoot>.

The yyyy-MM-dd and HHmm parts use the server-local time zone. The same zone governs the retention cutoff, so a misconfigured zone has visible consequences both here and in Retention and cleanup. Pin the JVM time zone explicitly on every TOA server instance - see Time zone.

5.1.2. Import identifier

The API-level import id has the form:

yyyy-MM-dd_HHmm-xxxxxxxx

The two halves correspond directly to the date folder and the import folder on disk. Given an id, an operator can locate the import on disk without searching:

<dataRoot>/<domain>/<yyyy-MM-dd>/<HHmm-xxxxxxxx>/

The xxxxxxxx suffix is 8 random hex characters; it makes the id unguessable for download URLs and keeps imports unique within the same minute.

5.1.3. What each file means

import.json

Single source of truth for an import. Operator-relevant fields:

  • status - DRAFT, PENDING, SUBMITTED or FAILED. FAILED carries an error message; SUBMITTED carries damisBatchId for cross-referencing the storage server.

  • userName / userEmail - whoever created the import in Outlook. userEmail is also the ownership key enforced on subsequent mutations.

  • documents[] - the list of documents in this import; each entry carries its template id, attribute values and pages[] metadata (filename, byte size, content type, sidecar filename if any).

The file is rewritten atomically (write to import.json.tmp, then ATOMIC_MOVE). If you ever see import.json.tmp left over, the server crashed mid-rewrite - it is safe to delete; the previous import.json is intact.

doc-N/

One folder per document inside the import. doc-1 always exists and corresponds to the original message the user uploaded. doc-2, doc-3, … are sibling documents created by the “extract attachments” flow - each holds the attachments split out of an EML page in doc-1 (or a later sibling).

doc-N/pages/page-M.bin

Raw page payload. The byte stream is whatever the client posted - typically an EML message for page-1 of doc-1, or a single extracted attachment for sibling-document pages. The .bin extension is intentional; the semantic content type lives in import.json and (for EML) in the sidecar.

doc-N/pages/page-M.meta.json

Sidecar produced for message/rfc822 pages only. Contains the decoded from / subject and the list of MIME attachments (filename, content type, decoded size). It is a convenience index - if the sidecar is missing or unreadable, the .bin is still the source of truth and the server falls back to re-parsing on demand. Sidecars are rewritten when attachments are extracted, so they always match the on-disk EML.

cmserver2.xml

Only present for domains whose templates are loaded from a URL (see Template catalogue). Cached copy of the last successfully downloaded catalogue; used as the fallback when the next refresh fails. Safe to delete - the next refresh re-downloads it. If you delete it while the remote URL is also unreachable, the domain has no catalogue until either the URL recovers or you drop in a copy by hand.

5.1.4. Atomicity guarantees

The layout is designed so that an operator’s mental model matches the filesystem state without race conditions:

  • A page binary file existing on disk implies the upload completed. Interrupted uploads leave no page-N.bin at all - never a half-written one. The controller streams the request body to a sibling .tmp file and ATOMIC_MOVE it into place.

  • import.json and the page binaries inside the same import folder are mutated under a per-import lock held by DomainStorage, so concurrent addPage / createDocumentFromAttachments / submit calls cannot interleave their writes.

  • Atomicity is per-import. Two different imports under the same date folder are independent; backing up or deleting one never affects the other.

5.1.5. Manual operations

Because the layout has no database the following are all safe shell operations, as long as the server is not actively writing to the target import:

  • Inspect an import: cat <importPath>/import.json, ls <importPath>/doc-*/pages/.

  • Archive an import: tar or zip the HHmm-xxxxxxxx folder. The server will not notice it is gone until the next API call references it.

  • Delete a single import: rm -rf the HHmm-xxxxxxxx folder. The corresponding API id will then return 404.

  • Bulk delete old date folders: see Retention and cleanup.

Do not rename folders or hand-edit import.json while the server is running - the per-import lock is in-process only and external renames will be observed mid-operation. Stop the server first.