Document Access Audit — Generate a Workspace-Aware Access Report
This article explains how to generate a workspace-aware audit report from platform nginx access logs.
The report is designed to answer questions such as:
- Which finalised documents did each user view on a given day?
- When did a user access the realtime feed?
- Which users were active in a given workspace?
This process is useful when creating forensic timelines for security, compliance, or internal investigations.
The workflow has two stages and must be run in order:
gather_logs.py— collects, decompresses, and filters the raw nginx logsparse_logs.py— parses the filtered log and outputs a structured.xlsxreport with workspace-aware user attribution
Prerequisites
Before running the scripts, confirm the following:
- The scripts are being run on the platform web host
- Python 3.10 or newer is installed
- The
openpyxlPython package is installed - You have read access to:
/var/log/nginx/mnt/data/private/_ws
Install dependency
pip3 install openpyxlIf the host uses an externally managed Python environment, run:
pip3 install openpyxl --break-system-packagesPermissions required
You must have read access to:
/var/log/nginx— contains the rotated access logs/mnt/data/private/_ws— contains each workspaceusers.txtfile
On the platform host, running as root or a user with appropriate nginx log access is typically sufficient.
File Locations
This process assumes the following files are stored together in one directory, for example /opt/audit/:
/opt/audit/
├── gather_logs.py
├── parse_logs.py
└── RUNBOOK.mdInput files
The scripts use the following source files:
/var/log/nginx/access.log/var/log/nginx/access.log.1/var/log/nginx/access.log.2.gz/var/log/nginx/access.log.3.gz- etc.
They also reference workspace user mapping files in the following structure:
/mnt/data/private/_ws/<3>/<3>/<wsid>/users.txtExample:
110027083 → /mnt/data/private/_ws/110/027/110027083/users.txtIntermediate output
Stage 1 generates:
access.log.docviews
This is the filtered log file consumed by Stage 2.
Final output
Stage 2 generates:
report.xlsx
This workbook includes three sheets:
- Raw Data
- User Summary
- Users
Process Summary
The workflow is:
- Read all
access.log*files from/var/log/nginx - Decompress rotated
.gzlogs on the fly - Filter by date range
- Exclude known noise
- Keep only relevant document/realtime access entries
- Write the filtered results to
access.log.docviews - Parse the filtered file
- Backfill missing user IDs and workspaces where possible
- Resolve user emails from workspace
users.txt - Output the final audit workbook as
report.xlsx
Step 1 — Gather Logs
Use gather_logs.py to collect and filter the nginx access logs.
This script:
- walks
/var/log/nginx - decompresses
.gzfiles where needed - filters by date range
- excludes noisy or irrelevant entries
- keeps only document-view or realtime-related endpoints
- writes the filtered output to a single file
Required arguments
You must supply:
--start--end
Both dates must be in dd/mm/yyyy format and are inclusive.
Example command
python3 gather_logs.py \
--start 01/05/2026 \
--end 07/05/2026 \
--output access.log.docviewsOptional argument
If needed, override the nginx log location with:
--log-dir
Default:
/var/log/nginxExpected output
A successful run will show:
- the log directory used
- which files were discovered
- counts for lines processed
- counts excluded by date/noise filters
- number of lines written to output
Example:
Log directory: /var/log/nginx
Found 5 log file(s) (oldest first):
- access.log.4.gz
- access.log.3.gz
- access.log.2.gz
- access.log.1
- access.log
Date range: 01/05/2026 to 07/05/2026 (inclusive)
Total lines read: 12,438,201
Unparseable timestamp: 2
Out of date range: 9,118,304
In range, noise excluded: 1,902,557
In range, not docview: 1,201,820
Kept (written to output): 215,518
Lines kept per source file:
- access.log.4.gz: 41,203
- access.log.3.gz: 38,891
...
Wrote access.log.docviewsImportant: The inclusion filter checks the request path only, not the full log line. This prevents unrelated requests from being included just because the referer mentions a tracked endpoint.
Step 2 — Parse Logs
Use parse_logs.py to parse the filtered log file and generate the audit workbook.
This script:
- reads
access.log.docviews - classifies each line
- infers missing user IDs and workspace IDs
- excludes admin traffic
- resolves emails from workspace
users.txt - generates the
.xlsxreport
Standard command
python3 parse_logs.py access.log.docviews report.xlsxOptional argument
If your workspace user-map files are stored elsewhere, use:
--user-map-base
Default:
/mnt/data/private/_wsExpected output
Example:
Parsed 215,518 lines, skipped 0 non-matching lines.
UID backfilled on 78,201 rows.
Workspace backfilled on 412 rows.
Excluded 9,114 rows for uids in ['1'].
Activity counts (after exclusion): Document=171,008, Realtime=35,396, Other=0
User-map base: /mnt/data/private/_ws
Workspaces seen: 184
Mapping entries loaded: 2,041
(ws, uid) pairs in logs: 612
Resolved with email: 608
Unresolved: 4
Unresolved (ws, uid) pairs (first 10):
- ws=110041100, uid=89
...
Wrote report.xlsxWhat to check: The Other count should normally be 0. If it is greater than 0, it usually means an endpoint passed through gather but was not classified correctly in parse. Review this before trusting the output.
Understanding the Report
The workbook contains three sheets.
1. Raw Data
This is the full parsed audit trail, with one row per request.
Useful columns include:
DateTimeUser IDUID SourceEmailActivity TypeSub-TypeServiceWorkspaceWS SourceCodeDoc DatePage (n)Doc PathReferer PageRaw Log
Use this sheet when you need exact event-level detail.
Example use case:
- Filter by
User ID - Filter
DateTimeto a specific day - Review all document and realtime activity for that user
2. User Summary
This is the high-level summary sheet.
Each row represents one (Workspace, User ID) combination.
Useful columns include:
Workspace IDUser IDEmailFirst ActivityLast ActivityTotal RequestsDocument RequestsRealtime RequestsOther RequestsDistinct Document CodesDistinct IPs
Use this sheet when answering questions such as:
- What was this user doing in a workspace?
- Which users were active in a given matter?
- How much document activity occurred for a given user?
3. Users
This sheet shows the resolved mapping between:
- Workspace
- User ID
- Mapping status
Possible values in Status include:
resolvedno email in users.txtworkspace users.txt missinguid not in users.txt
Use this sheet to validate that the user attribution is correct before relying on the report.
Post-Run Validation
After generating a report, check the following first:
Check 1 — Other=0
The parse output should show:
Other=0If Other is non-zero, review the Raw Data sheet and investigate any unclassified endpoints.
Check 2 — unresolved users should be low
A small unresolved count is normal.
A high unresolved count may indicate:
- missing
users.txtfiles - removed users
- incorrect inference/backfill
- workspace path issues
Check 3 — workspace backfill should be low
Workspace backfill should generally be minimal.
A spike may indicate:
- a change in upstream URL patterns
- additional endpoints that do not expose workspace IDs clearly
Troubleshooting
No nginx logs found
Error:
ERROR: no access.log* files found in /var/log/nginxPossible causes:
- wrong log directory
- logs rotated or moved
- logs stored off-box
Check with:
ls -la /var/log/nginx/access.log*If needed, re-run with --log-dir.
openpyxl install fails with externally managed environment
Run:
pip3 install openpyxl --break-system-packagesOr use a virtual environment:
python3 -m venv /opt/audit/.venv
. /opt/audit/.venv/bin/activate
pip install openpyxlusers.txt missing warnings
Example:
WARNING: users.txt missing for N workspace(s)This can be expected for:
- deleted workspaces
- incomplete workspace copies
- edge cases
Check the Users sheet for affected (workspace, uid) pairs.
(unknown) rows in User Summary
This means the parser could not determine a user ID from:
- the URL
- session inference
Common causes:
- pre-login traffic
- unauthenticated requests
- endpoints incorrectly included by gather
Review the Raw Data sheet filtered to unknown users.
Other activity type is non-zero
This means one or more paths were not classified correctly.
Review:
Activity Type = OtherServicecolumn
Then update the parser configuration as needed.
High unparseable timestamp count
A small number is normal.
A larger count may mean:
- a changed nginx log format
- truncated or malformed lines
Compare sample lines to the regex definitions in gather_logs.py.
Configuration and Extension Points
Configuration is maintained directly in the Python scripts.
Add a new noise pattern
If internal monitoring or another noisy endpoint is polluting the report:
- update
EXCLUSION_SUBSTRINGSingather_logs.py
Be specific to avoid excluding legitimate traffic.
Add or remove a tracked endpoint
To track a new valid endpoint:
- add it to
INCLUSION_ENDPOINTSingather_logs.py - add it to the correct classifier list in
parse_logs.py
Relevant classifier groups include:
DOCUMENT_VIEW_SERVICESDOCUMENT_BROWSE_SERVICESREALTIME_SERVICES
Exclude additional user IDs
To exclude other known service or automation accounts:
- update
EXCLUDED_UIDSinparse_logs.py
Change the user-map base path
If running outside the standard platform location:
- use
--user-map-base /path/to/users
Change the session backfill window
The default session backfill window is:
1800seconds- 30 minutes
This is defined inside:
backfill_uidsbackfill_workspaces
Tighten or relax this only if needed.
Re-Running the Process
Re-run both stages
Use this when:
- changing the date range
- gathering new logs
- updating noise exclusions
Example:
python3 gather_logs.py --start 08/05/2026 --end 14/05/2026 --output access.log.docviews
python3 parse_logs.py access.log.docviews report.xlsxRe-run parse only
Use this when:
- adjusting classification
- changing report columns
- updating excluded user IDs
- reissuing the report from the same gathered file
Example:
python3 parse_logs.py access.log.docviews report.xlsxThis is faster and useful during report iteration.
Data Handling and Security
The generated report contains sensitive data, including:
- email addresses
- IP addresses
- document codes
- workspace identifiers
Treat the output the same way you would treat raw access logs.
Recommended handling
- store the
.xlsxon a secured audit share - limit access to authorised personnel only
- follow your existing retention and clean-up policies
The intermediate file access.log.docviews is also sensitive and should be:
- deleted after use, or
- stored securely alongside the final report if needed for audit traceability
Network behaviour
These scripts do not perform any network I/O.
They:
- read only from local disk
- write only to local disk
- do not send analytics, telemetry, or external traffic
This makes them suitable for isolated or security-sensitive environments.
Related Notes
This process is intended for audit and forensic reporting. It should not be used as a substitute for platform monitoring, user analytics, or ongoing behavioural reporting unless specifically approved.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article