moonfire-nvr/guide/troubleshooting.md

# Troubleshooting

Here are some tips for diagnosing various problems with Moonfire NVR. Feel free
to open an [issue](https://github.com/scottlamb/moonfire-nvr/issues) if you
need more help.

* [Troubleshooting](#troubleshooting)
    * [Viewing Moonfire NVR's logs](#viewing-moonfire-nvrs-logs)
        * [Flushes](#flushes)
        * [`thread '...' panicked` errors](#thread--panicked-errors)
        * [Slow operations](#slow-operations)
        * [Camera stream errors](#camera-stream-errors)
    * [Problems](#problems)
        * [`Error: pts not monotonically increasing; got 26615520 then 26539470`](#error-pts-not-monotonically-increasing-got-26615520-then-26539470)
        * [`moonfire-nvr config` displays garbage](#moonfire-nvr-config-displays-garbage)
        * [Moonfire NVR reports problems with the database or filesystem](#moonfire-nvr-reports-problems-with-the-database-or-filesystem)
        * [<a name="kernel-errors"></a> Errors in kernel logs](#-errors-in-kernel-logs)
            * [UAS errors](#uas-errors)
            * [Filesystem errors](#filesystem-errors)

## Viewing Moonfire NVR's logs

While Moonfire NVR is running, logs will be written to stderr.

*   When running the configuration UI, you typically should redirect stderr
    to a text file to avoid poor interaction between the interactive stdout
    output and the logging. If you use the recommended
    `nvr config 2>debug-log` command, output will be in the `debug-log` file.
*   When running detached through Docker, Docker saves the logs for you.
    Try `nvr logs` or `docker logs moonfire-nvr`.
*   When running through systemd, stderr will be redirected to the journal.
    Try `sudo journalctl --unit moonfire-nvr` to view the logs. You also
    likely want to set `MOONFIRE_FORMAT=google-systemd` to format logs as
    expected by systemd.

Logging options are controlled by environment variables:

*   `MOONFIRE_LOG` controls the log level. Its format is similar to the
    `RUST_LOG` variable used by the
    [env-logger](http://rust-lang-nursery.github.io/log/env_logger/) crate.
    `MOONFIRE_LOG=info` is the default.
    `MOONFIRE_LOG=info,moonfire_nvr=debug` gives more detailed logging of the
    `moonfire_nvr` crate itself.
*   `MOONFIRE_FORMAT` selects the output format. The two options currently
    accepted are `google` (the default, like the Google
    [glog](https://github.com/google/glog) package) and `google-systemd` (a
    variation for better systemd compatibility).
*   Errors include a backtrace if `RUST_BACKTRACE=1` is set.

If you use Docker, set these via Docker's `--env` argument.

With the default `MOONFIRE_FORMAT=glog`, log lines are in the following
format:

```text
I20210308 21:31:24.255 main moonfire_nvr] Success.
LYYYYmmdd HH:MM:SS.FFF TTTT PPPPPPPPPPPP] ...
L    = level:
       E = error
       W = warn
       I = info
       D = debug
       T = trace
YYYY = year
mm   = month
dd   = day
HH   = hour (using a 24-hour clock)
MM   = minute
SS   = second
FFF  = fractional portion of the second
TTTT = thread name (if set) or tid (otherwise)
PPPP = module path
...  = message body
```

You can use the following command to teach [`lnav`](http://lnav.org/) Moonfire
NVR's log format:

```
$ lnav -i misc/moonfire_log.json
```

`lnav` versions prior to 0.9.0 print a (harmless) warning message on startup:

```
$ lnav -i git/moonfire-nvr/misc/moonfire_log.json
warning:git/moonfire-nvr/misc/moonfire_log.json:line 2
warning:  unexpected path --
warning:    /$schema
warning:  accepted paths --
warning:    /(?<format_name>\w+)/  -- The definition of a log file format.
info: installed: /home/slamb/.lnav/formats/installed/moonfire_log.json
```

You can avoid this by removing the `$schema` line from `moonfire_log.json`
and rerunning the `lnav -i` command.

Below are some interesting log lines you may encounter.

### Flushes

During normal operation, Moonfire NVR will periodically flush changes to its
SQLite3 database. Every flush is logged, as in the following info message:

```
I20210308 23:14:18.388 sync-/media/14tb/sample moonfire_db::db] Flush 3810 (why: 120 sec after start of 1 minute 14 seconds courtyard-main recording 3/1842086):
/media/6tb/sample: added 98M 864K 842B in 8 recordings (4/1839795, 7/1503516, 6/1853939, 1/1838087, 2/1852096, 12/1516945, 8/1514942, 10/1506111), deleted 111M 435K 587B in 5 (4/1801170, 4/1801171, 6/1799708, 1/1801528, 2/1815572), GCed 9 recordings (6/1799707, 7/1376577, 4/1801168, 1/1801527, 4/1801167, 4/1801169, 10/1243252, 2/1815571, 12/1418785).
/media/14tb/sample: added 8M 364K 643B in 3 recordings (3/1842086, 9/1505359, 11/1516695), deleted 0B in 0 (), GCed 0 recordings ().
```

This log message is packed with debugging information:

*   the date and time: `20210308 23:14:18.388`.
*   the name of the thread that prompted the flush: `sync-/media/14tb/sample`.
*   a sequence number: `3810`. This is handy for checking how often Moonfire NVR
    is flushing.
*   a reason for the flush: `120 sec after start of 1 minute 14 seconds courtyard-main recording 3/1842086`.
    This was a regular periodic flush at the `flush_if_sec` for the stream,
    as described in [install.md](install.md). `3/1842086` is an identifier for
    the recording, in the form `stream_id/recording_id`. It corresponds to the
    file `/media/14tb/sample/00000003001c1ba6`. On-disk files are named by
    a fixed eight hexadecimal digits for the stream id and eight hexadecimal
    digits for the recording id. You can convert with `printf`:
    ```
    $ printf '%08x%08x\n' 3 1842086
    00000003001c1ba6
    ```
*   For each affected sample file directory (`/media/6tb/sample` and
    `/media/14tb/sample`), a line showing the exact changes included in the
    flush. There are three kinds of changes:

    *   added recordings–these files are already fully written in the sample
        file directory and now are being added to the database.
    *   deleted recordings–these are being removed from the database's
        `recording` table (and added to the `garbage` table) in preparation
        for being deleted from the sample file directory. They can no longer
        be accessed after this flush.
    *   GCed (garbage-collected) recordings—these have been fully removed from
        disk and no longer will be referenced in the database at all.

    You can learn more about these in the "Lifecycle of a recording" section
    of the [recording schema design document](../design/schema.md).

    For added and deleted recordings, the line includes sizes in bytes
    (`98M 864K 842B` represents 10,3646,026 bytes, or about 99 MiB), numbers
    of recordings, and the IDs of each recording. For GCed recordings, the
    sizes are omitted (as this information is not stored).

### `thread '...' panicked` errors

Errors like the one below indicate a serious bug in Moonfire NVR. Please
file a bug if you see one. It's helpful to set the `RUST_BACKTRACE`
environment variable to include more information.

```
thread 's-peck_west-main' panicked at 'should always be an unindexed sample', /usr/local/src/moonfire-nvr/server/db/writer.rs:750:54
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

### Slow operations

Warnings like the following indicate that some operation took more than 1
second to perform. `PT2.070715796S` means about 2 seconds.

It's normal to see these warnings on startup and occasionally while running.
Frequent occurrences may indicate a performance problem.

```
W20201129 12:01:21.128 s-driveway-main moonfire_base::clock] opening rtsp://admin:redacted@192.168.5.108/cam/realmonitor?channel=1&subtype=0&unicast=true&proto=Onvif took PT2.070715796S!
W20201129 12:32:15.870 s-west_side-sub moonfire_base::clock] getting next packet took PT10.158121387S!
W20201228 12:09:29.050 s-back_east-sub moonfire_base::clock] database lock acquisition took PT8.122452
W20201228 21:22:32.012 main moonfire_base::clock] database operation took PT39.526386958S!
W20201228 21:27:11.402 s-driveway-sub moonfire_base::clock] writing 37 bytes took PT20.701894190S!
```

### Camera stream errors

Warnings like the following indicate that a camera stream was lost due to some
error and Moonfire NVR will try reconnecting shortly. In this case,
`End of file` means that the camera ended the stream. This might happen when the
camera is rebooting or if Moonfire is not consuming packets quickly enough.
In the latter case, you'll likely see a `getting next packet took PT...S!`
message as described above.

```
W20210309 00:28:55.527 s-courtyard-sub moonfire_nvr::streamer] courtyard-sub: sleeping for Duration { secs: 1, nanos: 0 } after error: End of file
(set environment variable RUST_BACKTRACE=1 to see backtraces)
```

## Problems

### `Error: pts not monotonically increasing; got 26615520 then 26539470`

If your streams cut out and you see error messages like this one in Moonfire
NVR logs, it might mean that your camera outputs [B
frames](https://en.wikipedia.org/wiki/Video_compression_picture_types#Bi-directional_predicted_.28B.29_frames.2Fslices_.28macroblocks.29).
If you believe this is the case, file a feature request; Moonfire NVR
currently doesn't support B frames. You may be able to configure your camera
to disable B frames in the meantime.

### `moonfire-nvr config` displays garbage

This happens if your machine is configured to a non-UTF-8 locale, due to
gyscos/Cursive#13. As a workaround, try setting the environment variable
`LC_ALL=C.UTF-8`. This should automatically be set with the Docker container.

### Moonfire NVR reports problems with the database or filesystem

It's helpful to check out your system's overall health when diagnosing
problems with Moonfire NVR.

1.  Look at your kernel logs. On most Linux systems, you can browse them via
    `journalctl`, `dmesg`, or `less /var/log/messages`. See [Errors in kernel
    logs](#error) below for some common problems.
2.  Use [`smartctl`](https://linuxconfig.org/how-to-check-an-hard-drive-health-from-the-command-line-using-smartctl) to
    look at SMART ("Self-Monitoring, Analysis and Reporting Technology System
    (SMART)") attributes on your flash and hard drives. Backblaze
    [reports](https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/)
    that the following SMART attributes are most predictive of drive failure:
    *   SMART 5: Reallocated Sectors Count
    *   SMART 187: Reported Uncorrectable Errors
    *   SMART 188: Command Timeout
    *   SMART 197: Current Pending Sector Count
    *   SMART 198: Uncorrectable Sector Count
    If the RAW value for any of these attributes is non-zero, it's likely
    your problem is due to hardware.
3.  Use `smartctl` to run a self-test on your flash and hard drives.
4.  Run `fsck` on your filesystems.

    Your root filesystem is best checked on startup, before it's mounted as
    read-write. On most Linux systems, you can force `fsck` to run on next
    startup via the `fsck.mode=force` kernel parameter, as documented
    [here](https://www.freedesktop.org/software/systemd/man/systemd-fsck@.service.html).

    If you have hard drives dedicated to Moonfire NVR, you can also shut down
    Moonfire NVR, unmount the filesystem, and run `fsck` on them without
    rebooting.

After the system as a whole is verified healthy, run `moonfire-nvr check` while
Moonfire NVR is stopped to verify integrity of the SQLite database and sample
file directories.

### <a name="kernel-errors"></a> Errors in kernel logs

#### UAS errors

Some cheap USB SATA adapters don't appear to work reliably in UAS mode under
Linux. If you see errors like the following, try [disabling
UAS](https://unix.stackexchange.com/questions/239782/connection-problem-with-usb3-external-storage-on-linux-uas-driver-problem).
Unfortunately your filesystem is likely to have corruption, so after disabling UAS,
run a `fsck` and then `moonfire-nvr check` to try recovering.

```
Sep 22 17:26:01 nuc kernel: sd 4:0:0:1: [sdb] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
Sep 22 17:26:01 nuc kernel: sd 4:0:0:1: [sdb] tag#2 CDB: Write(16) 8a 00 00 00 00 01 4d b4 c4 00 00 00 03 b0 00 00
```

#### Filesystem errors

Errors that mention `EXT4-fs` (or your filesystem of choice) likely indicate
filesystem corruption. Run `fsck` to fix as described above. Once the
corruption is addressed, use `moonfire-nvr check` to survey the damage to
your database.

```
Jan 28 07:26:27 nuc kernel: EXT4-fs (sdc1): error count since last fsck: 12
Jan 28 07:26:27 nuc kernel: EXT4-fs (sdc1): initial error at time 1576998292: ext4_validate_block_bitmap:376
Jan 28 07:26:27 nuc kernel: EXT4-fs (sdc1): last error at time 1579640202: ext4_validate_block_bitmap:376
...
Feb 13 04:48:43 nuc kernel: EXT4-fs error (device sdc1): ext4_validate_block_bitmap:376: comm kworker/u8:2: bg 57266: bad block bitmap checksum
Feb 13 04:48:43 nuc kernel: EXT4-fs (sdc1): Delayed block allocation failed for inode 7334278 at logical offset 0 with max blocks 11 with error 74
Feb 13 04:48:43 nuc kernel: EXT4-fs (sdc1): This should not happen!! Data will be lost
```