Generating Audit Log for User Access

Modified on Sat, 9 May at 12:48 AM

Document Access Audit — Generate a Workspace-Aware Access Report

This article explains how to generate a workspace-aware audit report from platform nginx access logs.

The report is designed to answer questions such as:

Which finalised documents did each user view on a given day?
When did a user access the realtime feed?
Which users were active in a given workspace?

This process is useful when creating forensic timelines for security, compliance, or internal investigations.

The workflow has two stages and must be run in order:

gather_logs.py — collects, decompresses, and filters the raw nginx logs
parse_logs.py — parses the filtered log and outputs a structured .xlsx report with workspace-aware user attribution

Prerequisites

Before running the scripts, confirm the following:

The scripts are being run on the platform web host
Python 3.10 or newer is installed
The openpyxl Python package is installed
You have read access to:
- /var/log/nginx
- /mnt/data/private/_ws

Install dependency

pip3 install openpyxl

If the host uses an externally managed Python environment, run:

pip3 install openpyxl --break-system-packages

Permissions required

You must have read access to:

/var/log/nginx — contains the rotated access logs
/mnt/data/private/_ws — contains each workspace users.txt file

On the platform host, running as root or a user with appropriate nginx log access is typically sufficient.

File Locations

This process assumes the following files are stored together in one directory, for example /opt/audit/:

/opt/audit/
├── gather_logs.py
├── parse_logs.py
└── RUNBOOK.md

Input files

The scripts use the following source files:

/var/log/nginx/access.log
/var/log/nginx/access.log.1
/var/log/nginx/access.log.2.gz
/var/log/nginx/access.log.3.gz
etc.

They also reference workspace user mapping files in the following structure:

/mnt/data/private/_ws/<3>/<3>/<wsid>/users.txt

Example:

110027083 → /mnt/data/private/_ws/110/027/110027083/users.txt

Intermediate output

Stage 1 generates:

access.log.docviews

This is the filtered log file consumed by Stage 2.

Final output

Stage 2 generates:

report.xlsx

This workbook includes three sheets:

Raw Data
User Summary
Users

Process Summary

The workflow is:

Read all access.log* files from /var/log/nginx
Decompress rotated .gz logs on the fly
Filter by date range
Exclude known noise
Keep only relevant document/realtime access entries
Write the filtered results to access.log.docviews
Parse the filtered file
Backfill missing user IDs and workspaces where possible
Resolve user emails from workspace users.txt
Output the final audit workbook as report.xlsx

Step 1 — Gather Logs

Use gather_logs.py to collect and filter the nginx access logs.

This script:

walks /var/log/nginx
decompresses .gz files where needed
filters by date range
excludes noisy or irrelevant entries
keeps only document-view or realtime-related endpoints
writes the filtered output to a single file

Required arguments

You must supply:

--start
--end

Both dates must be in dd/mm/yyyy format and are inclusive.

Example command

python3 gather_logs.py \
    --start 01/05/2026 \
    --end 07/05/2026 \
    --output access.log.docviews

Optional argument

If needed, override the nginx log location with:

--log-dir

Default:

/var/log/nginx

Expected output

A successful run will show:

the log directory used
which files were discovered
counts for lines processed
counts excluded by date/noise filters
number of lines written to output

Example:

Log directory: /var/log/nginx
Found 5 log file(s) (oldest first):
  - access.log.4.gz
  - access.log.3.gz
  - access.log.2.gz
  - access.log.1
  - access.log

Date range: 01/05/2026 to 07/05/2026 (inclusive)
  Total lines read:               12,438,201
  Unparseable timestamp:                   2
  Out of date range:               9,118,304
  In range, noise excluded:        1,902,557
  In range, not docview:           1,201,820
  Kept (written to output):          215,518

Lines kept per source file:
  - access.log.4.gz: 41,203
  - access.log.3.gz: 38,891
  ...
Wrote access.log.docviews

Important: The inclusion filter checks the request path only, not the full log line. This prevents unrelated requests from being included just because the referer mentions a tracked endpoint.

Step 2 — Parse Logs

Use parse_logs.py to parse the filtered log file and generate the audit workbook.

This script:

reads access.log.docviews
classifies each line
infers missing user IDs and workspace IDs
excludes admin traffic
resolves emails from workspace users.txt
generates the .xlsx report

Standard command

python3 parse_logs.py access.log.docviews report.xlsx

Optional argument

If your workspace user-map files are stored elsewhere, use:

--user-map-base

Default:

/mnt/data/private/_ws

Expected output

Example:

Parsed 215,518 lines, skipped 0 non-matching lines.
UID backfilled on 78,201 rows.
Workspace backfilled on 412 rows.
Excluded 9,114 rows for uids in ['1'].
Activity counts (after exclusion): Document=171,008, Realtime=35,396, Other=0
User-map base: /mnt/data/private/_ws
  Workspaces seen: 184
  Mapping entries loaded: 2,041
  (ws, uid) pairs in logs: 612
  Resolved with email: 608
  Unresolved: 4
  Unresolved (ws, uid) pairs (first 10):
    - ws=110041100, uid=89
    ...
Wrote report.xlsx

What to check: The Other count should normally be 0. If it is greater than 0, it usually means an endpoint passed through gather but was not classified correctly in parse. Review this before trusting the output.

Understanding the Report

The workbook contains three sheets.

1. Raw Data

This is the full parsed audit trail, with one row per request.

Useful columns include:

DateTime
User ID
UID Source
Email
Activity Type
Sub-Type
Service
Workspace
WS Source
Code
Doc Date
Page (n)
Doc Path
Referer Page
Raw Log

Use this sheet when you need exact event-level detail.

Example use case:

Filter by User ID
Filter DateTime to a specific day
Review all document and realtime activity for that user

2. User Summary

This is the high-level summary sheet.

Each row represents one (Workspace, User ID) combination.

Useful columns include:

Workspace ID
User ID
Email
First Activity
Last Activity
Total Requests
Document Requests
Realtime Requests
Other Requests
Distinct Document Codes
Distinct IPs

Use this sheet when answering questions such as:

What was this user doing in a workspace?
Which users were active in a given matter?
How much document activity occurred for a given user?

3. Users

This sheet shows the resolved mapping between:

Workspace
User ID
Email
Mapping status

Possible values in Status include:

resolved
no email in users.txt
workspace users.txt missing
uid not in users.txt

Use this sheet to validate that the user attribution is correct before relying on the report.

Post-Run Validation

After generating a report, check the following first:

Check 1 — `Other=0`

The parse output should show:

Other=0

If Other is non-zero, review the Raw Data sheet and investigate any unclassified endpoints.

Check 2 — unresolved users should be low

A small unresolved count is normal.

A high unresolved count may indicate:

missing users.txt files
removed users
incorrect inference/backfill
workspace path issues

Check 3 — workspace backfill should be low

Workspace backfill should generally be minimal.

A spike may indicate:

a change in upstream URL patterns
additional endpoints that do not expose workspace IDs clearly

Troubleshooting

No nginx logs found

Error:

ERROR: no access.log* files found in /var/log/nginx

Possible causes:

wrong log directory
logs rotated or moved
logs stored off-box

Check with:

ls -la /var/log/nginx/access.log*

If needed, re-run with --log-dir.

`openpyxl` install fails with externally managed environment

Run:

pip3 install openpyxl --break-system-packages

Or use a virtual environment:

python3 -m venv /opt/audit/.venv
. /opt/audit/.venv/bin/activate
pip install openpyxl

`users.txt` missing warnings

Example:

WARNING: users.txt missing for N workspace(s)

This can be expected for:

deleted workspaces
incomplete workspace copies
edge cases

Check the Users sheet for affected (workspace, uid) pairs.

`(unknown)` rows in User Summary

This means the parser could not determine a user ID from:

the URL
session inference

Common causes:

pre-login traffic
unauthenticated requests
endpoints incorrectly included by gather

Review the Raw Data sheet filtered to unknown users.

`Other` activity type is non-zero

This means one or more paths were not classified correctly.

Review:

Activity Type = Other
Service column

Then update the parser configuration as needed.

High unparseable timestamp count

A small number is normal.

A larger count may mean:

a changed nginx log format
truncated or malformed lines

Compare sample lines to the regex definitions in gather_logs.py.

Configuration and Extension Points

Configuration is maintained directly in the Python scripts.

Add a new noise pattern

If internal monitoring or another noisy endpoint is polluting the report:

update EXCLUSION_SUBSTRINGS in gather_logs.py

Be specific to avoid excluding legitimate traffic.

Add or remove a tracked endpoint

To track a new valid endpoint:

add it to INCLUSION_ENDPOINTS in gather_logs.py
add it to the correct classifier list in parse_logs.py

Relevant classifier groups include:

DOCUMENT_VIEW_SERVICES
DOCUMENT_BROWSE_SERVICES
REALTIME_SERVICES

Exclude additional user IDs

To exclude other known service or automation accounts:

update EXCLUDED_UIDS in parse_logs.py

Change the user-map base path

If running outside the standard platform location:

use --user-map-base /path/to/users

Change the session backfill window

The default session backfill window is:

1800 seconds
30 minutes

This is defined inside:

backfill_uids
backfill_workspaces

Tighten or relax this only if needed.

Re-Running the Process

Re-run both stages

Use this when:

changing the date range
gathering new logs
updating noise exclusions

Example:

python3 gather_logs.py --start 08/05/2026 --end 14/05/2026 --output access.log.docviews
python3 parse_logs.py access.log.docviews report.xlsx

Re-run parse only

Use this when:

adjusting classification
changing report columns
updating excluded user IDs
reissuing the report from the same gathered file

Example:

python3 parse_logs.py access.log.docviews report.xlsx

This is faster and useful during report iteration.

Data Handling and Security

The generated report contains sensitive data, including:

email addresses
IP addresses
document codes
workspace identifiers

Treat the output the same way you would treat raw access logs.

Recommended handling

store the .xlsx on a secured audit share
limit access to authorised personnel only
follow your existing retention and clean-up policies

The intermediate file access.log.docviews is also sensitive and should be:

deleted after use, or
stored securely alongside the final report if needed for audit traceability

Network behaviour

These scripts do not perform any network I/O.

They:

read only from local disk
write only to local disk
do not send analytics, telemetry, or external traffic

This makes them suitable for isolated or security-sensitive environments.

Related Notes

This process is intended for audit and forensic reporting. It should not be used as a substitute for platform monitoring, user analytics, or ongoing behavioural reporting unless specifically approved.

Attachments (2)

gather_logs.py
8.94 KB

parse_logs.py
20.4 KB

Generating Audit Log for User Access

Document Access Audit — Generate a Workspace-Aware Access Report

Prerequisites

Install dependency

Permissions required

File Locations

Input files

Intermediate output

Final output

Process Summary

Step 1 — Gather Logs

Required arguments

Example command

Optional argument

Expected output

Step 2 — Parse Logs

Standard command

Optional argument

Expected output

Understanding the Report

1. Raw Data

2. User Summary

3. Users

Post-Run Validation

Check 1 — Other=0

Check 2 — unresolved users should be low

Check 3 — workspace backfill should be low

Troubleshooting

No nginx logs found

openpyxl install fails with externally managed environment

users.txt missing warnings

(unknown) rows in User Summary

Other activity type is non-zero

High unparseable timestamp count

Configuration and Extension Points

Add a new noise pattern

Add or remove a tracked endpoint

Exclude additional user IDs

Change the user-map base path

Change the session backfill window

Re-Running the Process

Re-run both stages

Re-run parse only

Data Handling and Security

Recommended handling

Network behaviour

Related Notes

Check 1 — `Other=0`

`openpyxl` install fails with externally managed environment

`users.txt` missing warnings

`(unknown)` rows in User Summary

`Other` activity type is non-zero