switch from log to tracing

I think this is a big improvement in readability.

I removed the `lnav` config, which is a little sad, but I don't think it
supports this structured logging format well. Still seems worthwhile on
balance.
This commit is contained in:
Scott Lamb
2023-02-15 23:14:54 -08:00
parent db2e0f1d39
commit ebcdd76084
38 changed files with 632 additions and 344 deletions

View File

@@ -34,9 +34,13 @@ While Moonfire NVR is running, logs will be written to stderr.
Try `nvr logs` or `docker logs moonfire-nvr`.
* When running through systemd, stderr will be redirected to the journal.
Try `sudo journalctl --unit moonfire-nvr` to view the logs. You also
likely want to set `MOONFIRE_FORMAT=google-systemd` to format logs as
likely want to set `MOONFIRE_FORMAT=systemd` to format logs as
expected by systemd.
*Note:* Moonfire's log format has recently changed significantly. You may
encounter the older format in the issue tracker or (despite best efforts)
documentation that hasn't been updated.
Logging options are controlled by environment variables:
* `MOONFIRE_LOG` controls the log level. Its format is similar to the
@@ -45,80 +49,50 @@ Logging options are controlled by environment variables:
`MOONFIRE_LOG=info` is the default.
`MOONFIRE_LOG=info,moonfire_nvr=debug` gives more detailed logging of the
`moonfire_nvr` crate itself.
* `MOONFIRE_FORMAT` selects the output format. The two options currently
accepted are `google` (the default, like the Google
[glog](https://github.com/google/glog) package) and `google-systemd` (a
variation for better systemd compatibility).
* `MOONFIRE_COLOR` controls color coding when using the `google` format.
It accepts `always`, `never`, or `auto`. `auto` means to color code if
stderr is a terminal.
* `MOONFIRE_FORMAT` selects an output format. It defaults to an output meant
for human consumption. It can be overridden to either of the following:
* `systemd` uses [sd-daemon logging prefixes](https://man7.org/linux/man-pages/man3/sd-daemon.3.html))
* `json` outputs one JSON-formatted log message per line, for machine
consumption.
* Errors include a backtrace if `RUST_BACKTRACE=1` is set.
If you use Docker, set these via Docker's `--env` argument.
With the default `MOONFIRE_FORMAT=google`, log lines are in the following
format:
With `MOONFIRE_FORMAT` left unset, log events look as follows:
```text
I20210308 21:31:24.255 main moonfire_nvr] Success.
LYYYYmmdd HH:MM:SS.FFF TTTT PPPPPPPPPPPP] ...
L = level:
E = error; when color mode is on, the message will be bright red.
W = warn; " " " " " " " " " " yellow.
I = info
D = debug
T = trace
YYYY = year
mm = month
dd = day
HH = hour (using a 24-hour clock)
MM = minute
SS = second
FFF = fractional portion of the second
TTTT = thread name (if set) or tid (otherwise)
PPPP = log target (usually a module path)
... = message body
2023-02-15T22:45:06.999329 INFO s-courtyard-sub streamer{stream="courtyard-sub"}: moonfire_nvr::streamer: opening input url=rtsp://192.168.5.112/cam/realmonitor?channel=1&subtype=1&unicast=true&proto=Onvif
```
This example contains the following elements:
* the timestamp (`2023-02-15T22:45:06.9999329`) in the system's local zone.
* the log level (`INFO`) is one of `TRACE`, `DEBUG`, `INFO`, `WARN`, or
`ERROR`.
* the thread name (`s-courtyard-sub`), see explanation below.
* the "spans" (`streamer{stream="courtyard-sub"}`), which contain
context information for a group of messages. In this case there is a single
span `streamer` with a single field `stream`. There can be multiple
spans; they are listed starting from the root. Each may have fields.
* the target (`moonfire_nvr::streamer`), which generally corresponds to a Rust
module name.
* the log message (`opening input`), a human-readable string
* event fields (`url=...`)
Moonfire NVR names a few important thread types as follows:
* `main`: during `moonfire-nvr run`, the main thread does initial setup then
just waits for the other threads. In other subcommands, it does everything.
* `s-CAMERA-TYPE` (one per stream, where `TYPE` is `main`, `sub`, or `ext`):
these threads write video to disk.
* `sync-PATH` (one per sample file directory): These threads call `fsync` to
* `sync-DIR_ID` (one per sample file directory): These threads call `fsync` to
* commit sample files to disk, delete old sample files, and flush the
database.
* `r-PATH` (one per sample file directory): These threads read sample files
* `r-DIR_ID` (one per sample file directory): These threads read sample files
from disk for serving `.mp4` files.
* `tokio-runtime-worker` (one per core, unless overridden with
`--worker-threads`): these threads handle HTTP requests and read video
data from cameras via RTSP.
* `logger`: this thread writes the log buffer to `stderr`. Logging is
asynchronous; other threads don't wait for log messages to be written
unless the log buffer is full.
You can use the following command to teach [`lnav`](http://lnav.org/) Moonfire
NVR's log format:
```console
$ lnav -i misc/moonfire_log.json
```
`lnav` versions prior to 0.9.0 print a (harmless) warning message on startup:
```console
$ lnav -i git/moonfire-nvr/misc/moonfire_log.json
warning:git/moonfire-nvr/misc/moonfire_log.json:line 2
warning: unexpected path --
warning: /$schema
warning: accepted paths --
warning: /(?<format_name>\w+)/ -- The definition of a log file format.
info: installed: /home/slamb/.lnav/formats/installed/moonfire_log.json
```
You can avoid this by removing the `$schema` line from `moonfire_log.json`
and rerunning the `lnav -i` command.
Below are some interesting log lines you may encounter.
@@ -128,16 +102,16 @@ During normal operation, Moonfire NVR will periodically flush changes to its
SQLite3 database. Every flush is logged, as in the following info message:
```
I20210308 23:14:18.388 sync-/media/14tb/sample moonfire_db::db] Flush 3810 (why: 120 sec after start of 1 minute 14 seconds courtyard-main recording 3/1842086):
2021-03-08T23:14:18.388000 sync-2 syncer{path=/media/14tb/sample}:flush{flush_count=2 reason="120 sec after start of 1 minute 14 seconds courtyard-main recording 3/1842086"}: moonfire_db::db: flush complete:
/media/6tb/sample: added 98M 864K 842B in 8 recordings (4/1839795, 7/1503516, 6/1853939, 1/1838087, 2/1852096, 12/1516945, 8/1514942, 10/1506111), deleted 111M 435K 587B in 5 (4/1801170, 4/1801171, 6/1799708, 1/1801528, 2/1815572), GCed 9 recordings (6/1799707, 7/1376577, 4/1801168, 1/1801527, 4/1801167, 4/1801169, 10/1243252, 2/1815571, 12/1418785).
/media/14tb/sample: added 8M 364K 643B in 3 recordings (3/1842086, 9/1505359, 11/1516695), deleted 0B in 0 (), GCed 0 recordings ().
```
This log message is packed with debugging information:
* the date and time: `20210308 23:14:18.388`.
* the name of the thread that prompted the flush: `sync-/media/14tb/sample`.
* a sequence number: `3810`. This is handy for checking how often Moonfire NVR
* the date and time: `2021-03-08T23:14:18.388`.
* the name of the thread that prompted the flush: `sync-2`.
* a flush count: `3810`. This is handy for checking how often Moonfire NVR
is flushing.
* a reason for the flush: `120 sec after start of 1 minute 14 seconds courtyard-main recording 3/1842086`.
This was a regular periodic flush at the `flush_if_sec` for the stream,
@@ -178,9 +152,7 @@ file a bug if you see one. It's helpful to set the `RUST_BACKTRACE`
environment variable to include more information.
```
E20210304 11:09:29.230 main s-peck_west-main] panic at 'src/moonfire-nvr/server/db/writer.rs:750:54': should always be an unindexed sample
(set environment variable RUST_BACKTRACE=1 to see backtraces)"
2021-03-04T11:09:29.230291 ERROR s-peck_west-main streamer{stream="peck_west-main"}: panic: should always be an unindexed sample location=src/moonfire-nvr/server/db/writer.rs:750:54 backtrace=...
```
In this case, a stream thread (one starting with `s-`) panicked. That stream
@@ -195,11 +167,11 @@ It's normal to see these warnings on startup and occasionally while running.
Frequent occurrences may indicate a performance problem.
```
W20201129 12:01:21.128 s-driveway-main moonfire_base::clock] opening rtsp://admin:redacted@192.168.5.108/cam/realmonitor?channel=1&subtype=0&unicast=true&proto=Onvif took PT2.070715796S!
W20201129 12:32:15.870 s-west_side-sub moonfire_base::clock] getting next packet took PT10.158121387S!
W20201228 12:09:29.050 s-back_east-sub moonfire_base::clock] database lock acquisition took PT8.122452
W20201228 21:22:32.012 main moonfire_base::clock] database operation took PT39.526386958S!
W20201228 21:27:11.402 s-driveway-sub moonfire_base::clock] writing 37 bytes took PT20.701894190S!
2020-11-29T12:01:21.128725 WARN s-driveway-main streamer{stream="driveway-main"}: moonfire_base::clock: opening rtsp://admin:redacted@192.168.5.108/cam/realmonitor?channel=1&subtype=0&unicast=true&proto=Onvif took PT2.070715796S!
2020-11-29T12:32:15.870658 WARN s-west_side-sub streamer{stream="west_side-sub"}: moonfire_base::clock: getting next packet took PT10.158121387S!
2020-12-28T12:09:29.050464 WARN s-back_east-sub streamer{stream="s-back_east-sub"}: moonfire_base::clock: database lock acquisition took PT8.122452
2020-12-28T21:22:32.012811 WARN main moonfire_base::clock: database operation took PT39.526386958S!
2020-12-28T21:27:11.402259 WARN s-driveway-sub streamer{stream="s-driveway-sub"}: moonfire_base::clock: writing 37 bytes took PT20.701894190S!
```
### Camera stream errors
@@ -211,7 +183,7 @@ quickly enough. In the latter case, you'll likely see a
`getting next packet took PT...S!` message as described above.
```
W20210309 00:28:55.527 s-courtyard-sub moonfire_nvr::streamer] courtyard-sub: sleeping for PT1S after error: Stream ended
2021-03-09T00:28:55.527078 WARN s-courtyard-sub streamer{stream="courtyard-sub"}: moonfire_nvr::streamer: sleeping for PT1S after error: Stream ended
(set environment variable RUST_BACKTRACE=1 to see backtraces)
```
@@ -251,7 +223,7 @@ If Moonfire NVR runs out of disk space on a sample file directory, recording
will be stuck and you'll see log messages like the following:
```
W20210401 11:21:07.365 s-driveway-main moonfire_base::clock] sleeping for PT1S after error: No space left on device (os error 28)
2021-04-01T11:21:07.365 WARN s-driveway-main streamer{stream="s-driveway-main"}: moonfire_base::clock: sleeping for PT1S after error: No space left on device (os error 28)
```
If something else used more disk space on the filesystem than planned, just