Generating Audit Log for User Access

Modified on Sat, 9 May at 12:48 AM

Document Access Audit — Generate a Workspace-Aware Access Report

This article explains how to generate a workspace-aware audit report from platform nginx access logs.

The report is designed to answer questions such as:

  • Which finalised documents did each user view on a given day?
  • When did a user access the realtime feed?
  • Which users were active in a given workspace?

This process is useful when creating forensic timelines for security, compliance, or internal investigations.

The workflow has two stages and must be run in order:

  1. gather_logs.py — collects, decompresses, and filters the raw nginx logs
  2. parse_logs.py — parses the filtered log and outputs a structured .xlsx report with workspace-aware user attribution

Prerequisites

Before running the scripts, confirm the following:

  • The scripts are being run on the platform web host
  • Python 3.10 or newer is installed
  • The openpyxl Python package is installed
  • You have read access to:
    • /var/log/nginx
    • /mnt/data/private/_ws

Install dependency

pip3 install openpyxl

If the host uses an externally managed Python environment, run:

pip3 install openpyxl --break-system-packages

Permissions required

You must have read access to:

  • /var/log/nginx — contains the rotated access logs
  • /mnt/data/private/_ws — contains each workspace users.txt file

On the platform host, running as root or a user with appropriate nginx log access is typically sufficient.


File Locations

This process assumes the following files are stored together in one directory, for example /opt/audit/:

/opt/audit/
├── gather_logs.py
├── parse_logs.py
└── RUNBOOK.md

Input files

The scripts use the following source files:

  • /var/log/nginx/access.log
  • /var/log/nginx/access.log.1
  • /var/log/nginx/access.log.2.gz
  • /var/log/nginx/access.log.3.gz
  • etc.

They also reference workspace user mapping files in the following structure:

/mnt/data/private/_ws/<3>/<3>/<wsid>/users.txt

Example:

110027083 → /mnt/data/private/_ws/110/027/110027083/users.txt

Intermediate output

Stage 1 generates:

  • access.log.docviews

This is the filtered log file consumed by Stage 2.

Final output

Stage 2 generates:

  • report.xlsx

This workbook includes three sheets:

  • Raw Data
  • User Summary
  • Users

Process Summary

The workflow is:

  1. Read all access.log* files from /var/log/nginx
  2. Decompress rotated .gz logs on the fly
  3. Filter by date range
  4. Exclude known noise
  5. Keep only relevant document/realtime access entries
  6. Write the filtered results to access.log.docviews
  7. Parse the filtered file
  8. Backfill missing user IDs and workspaces where possible
  9. Resolve user emails from workspace users.txt
  10. Output the final audit workbook as report.xlsx

Step 1 — Gather Logs

Use gather_logs.py to collect and filter the nginx access logs.

This script:

  • walks /var/log/nginx
  • decompresses .gz files where needed
  • filters by date range
  • excludes noisy or irrelevant entries
  • keeps only document-view or realtime-related endpoints
  • writes the filtered output to a single file

Required arguments

You must supply:

  • --start
  • --end

Both dates must be in dd/mm/yyyy format and are inclusive.

Example command

python3 gather_logs.py \
    --start 01/05/2026 \
    --end 07/05/2026 \
    --output access.log.docviews

Optional argument

If needed, override the nginx log location with:

  • --log-dir

Default:

/var/log/nginx

Expected output

A successful run will show:

  • the log directory used
  • which files were discovered
  • counts for lines processed
  • counts excluded by date/noise filters
  • number of lines written to output

Example:

Log directory: /var/log/nginx
Found 5 log file(s) (oldest first):
  - access.log.4.gz
  - access.log.3.gz
  - access.log.2.gz
  - access.log.1
  - access.log

Date range: 01/05/2026 to 07/05/2026 (inclusive)
  Total lines read:               12,438,201
  Unparseable timestamp:                   2
  Out of date range:               9,118,304
  In range, noise excluded:        1,902,557
  In range, not docview:           1,201,820
  Kept (written to output):          215,518

Lines kept per source file:
  - access.log.4.gz: 41,203
  - access.log.3.gz: 38,891
  ...
Wrote access.log.docviews

Important: The inclusion filter checks the request path only, not the full log line. This prevents unrelated requests from being included just because the referer mentions a tracked endpoint.


Step 2 — Parse Logs

Use parse_logs.py to parse the filtered log file and generate the audit workbook.

This script:

  • reads access.log.docviews
  • classifies each line
  • infers missing user IDs and workspace IDs
  • excludes admin traffic
  • resolves emails from workspace users.txt
  • generates the .xlsx report

Standard command

python3 parse_logs.py access.log.docviews report.xlsx

Optional argument

If your workspace user-map files are stored elsewhere, use:

  • --user-map-base

Default:

/mnt/data/private/_ws

Expected output

Example:

Parsed 215,518 lines, skipped 0 non-matching lines.
UID backfilled on 78,201 rows.
Workspace backfilled on 412 rows.
Excluded 9,114 rows for uids in ['1'].
Activity counts (after exclusion): Document=171,008, Realtime=35,396, Other=0
User-map base: /mnt/data/private/_ws
  Workspaces seen: 184
  Mapping entries loaded: 2,041
  (ws, uid) pairs in logs: 612
  Resolved with email: 608
  Unresolved: 4
  Unresolved (ws, uid) pairs (first 10):
    - ws=110041100, uid=89
    ...
Wrote report.xlsx

What to check: The Other count should normally be 0. If it is greater than 0, it usually means an endpoint passed through gather but was not classified correctly in parse. Review this before trusting the output.


Understanding the Report

The workbook contains three sheets.

1. Raw Data

This is the full parsed audit trail, with one row per request.

Useful columns include:

  • DateTime
  • User ID
  • UID Source
  • Email
  • Activity Type
  • Sub-Type
  • Service
  • Workspace
  • WS Source
  • Code
  • Doc Date
  • Page (n)
  • Doc Path
  • Referer Page
  • Raw Log

Use this sheet when you need exact event-level detail.

Example use case:

  • Filter by User ID
  • Filter DateTime to a specific day
  • Review all document and realtime activity for that user

2. User Summary

This is the high-level summary sheet.

Each row represents one (Workspace, User ID) combination.

Useful columns include:

  • Workspace ID
  • User ID
  • Email
  • First Activity
  • Last Activity
  • Total Requests
  • Document Requests
  • Realtime Requests
  • Other Requests
  • Distinct Document Codes
  • Distinct IPs

Use this sheet when answering questions such as:

  • What was this user doing in a workspace?
  • Which users were active in a given matter?
  • How much document activity occurred for a given user?

3. Users

This sheet shows the resolved mapping between:

  • Workspace
  • User ID
  • Email
  • Mapping status

Possible values in Status include:

  • resolved
  • no email in users.txt
  • workspace users.txt missing
  • uid not in users.txt

Use this sheet to validate that the user attribution is correct before relying on the report.


Post-Run Validation

After generating a report, check the following first:

Check 1 — Other=0

The parse output should show:

Other=0

If Other is non-zero, review the Raw Data sheet and investigate any unclassified endpoints.

Check 2 — unresolved users should be low

A small unresolved count is normal.

A high unresolved count may indicate:

  • missing users.txt files
  • removed users
  • incorrect inference/backfill
  • workspace path issues

Check 3 — workspace backfill should be low

Workspace backfill should generally be minimal.

A spike may indicate:

  • a change in upstream URL patterns
  • additional endpoints that do not expose workspace IDs clearly

Troubleshooting

No nginx logs found

Error:

ERROR: no access.log* files found in /var/log/nginx

Possible causes:

  • wrong log directory
  • logs rotated or moved
  • logs stored off-box

Check with:

ls -la /var/log/nginx/access.log*

If needed, re-run with --log-dir.

openpyxl install fails with externally managed environment

Run:

pip3 install openpyxl --break-system-packages

Or use a virtual environment:

python3 -m venv /opt/audit/.venv
. /opt/audit/.venv/bin/activate
pip install openpyxl

users.txt missing warnings

Example:

WARNING: users.txt missing for N workspace(s)

This can be expected for:

  • deleted workspaces
  • incomplete workspace copies
  • edge cases

Check the Users sheet for affected (workspace, uid) pairs.

(unknown) rows in User Summary

This means the parser could not determine a user ID from:

  • the URL
  • session inference

Common causes:

  • pre-login traffic
  • unauthenticated requests
  • endpoints incorrectly included by gather

Review the Raw Data sheet filtered to unknown users.

Other activity type is non-zero

This means one or more paths were not classified correctly.

Review:

  • Activity Type = Other
  • Service column

Then update the parser configuration as needed.

High unparseable timestamp count

A small number is normal.

A larger count may mean:

  • a changed nginx log format
  • truncated or malformed lines

Compare sample lines to the regex definitions in gather_logs.py.


Configuration and Extension Points

Configuration is maintained directly in the Python scripts.

Add a new noise pattern

If internal monitoring or another noisy endpoint is polluting the report:

  • update EXCLUSION_SUBSTRINGS in gather_logs.py

Be specific to avoid excluding legitimate traffic.

Add or remove a tracked endpoint

To track a new valid endpoint:

  • add it to INCLUSION_ENDPOINTS in gather_logs.py
  • add it to the correct classifier list in parse_logs.py

Relevant classifier groups include:

  • DOCUMENT_VIEW_SERVICES
  • DOCUMENT_BROWSE_SERVICES
  • REALTIME_SERVICES

Exclude additional user IDs

To exclude other known service or automation accounts:

  • update EXCLUDED_UIDS in parse_logs.py

Change the user-map base path

If running outside the standard platform location:

  • use --user-map-base /path/to/users

Change the session backfill window

The default session backfill window is:

  • 1800 seconds
  • 30 minutes

This is defined inside:

  • backfill_uids
  • backfill_workspaces

Tighten or relax this only if needed.


Re-Running the Process

Re-run both stages

Use this when:

  • changing the date range
  • gathering new logs
  • updating noise exclusions

Example:

python3 gather_logs.py --start 08/05/2026 --end 14/05/2026 --output access.log.docviews
python3 parse_logs.py access.log.docviews report.xlsx

Re-run parse only

Use this when:

  • adjusting classification
  • changing report columns
  • updating excluded user IDs
  • reissuing the report from the same gathered file

Example:

python3 parse_logs.py access.log.docviews report.xlsx

This is faster and useful during report iteration.


Data Handling and Security

The generated report contains sensitive data, including:

  • email addresses
  • IP addresses
  • document codes
  • workspace identifiers

Treat the output the same way you would treat raw access logs.

Recommended handling

  • store the .xlsx on a secured audit share
  • limit access to authorised personnel only
  • follow your existing retention and clean-up policies

The intermediate file access.log.docviews is also sensitive and should be:

  • deleted after use, or
  • stored securely alongside the final report if needed for audit traceability

Network behaviour

These scripts do not perform any network I/O.

They:

  • read only from local disk
  • write only to local disk
  • do not send analytics, telemetry, or external traffic

This makes them suitable for isolated or security-sensitive environments.


Related Notes

This process is intended for audit and forensic reporting. It should not be used as a substitute for platform monitoring, user analytics, or ongoing behavioural reporting unless specifically approved.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article