This is useful for a combo scrub bar-based UI (#32) + live view UI (#59)
in a non-obvious way. When constructing a HTML Media Source Extensions
API SourceBuffer, the caller can specify a "mode" of either "segments"
or "sequence":
In "sequence" mode, playback assumes segments are added sequentially.
This is good enough for a live view-only UI (#59) but not for a scrub
bar UI in which you may want to seek backward to a segment you've never
seen before. You will then need to insert a segment out-of-sequence.
Imagine what happens when the user goes forward again until the end of
the segment inserted immediately before it. The user should see the
chronologically next segment or a pause for loading if it's unavailable.
The best approximation of this is to track the mapping of timestamps to
segments and insert a VTTCue with an enter/exit handler that seeks to
the right position. But seeking isn't instantaneous; the user will
likely briefly see first the segment they seeked to before. That's
janky. Additionally, the "canplaythrough" event will behave strangely.
In "segments" mode, playback respects the timestamps we set:
* The obvious choice is to use wall clock timestamps. This is fine if
they're known to be fixed and correct. They're not. The
currently-recording segment may be "unanchored", meaning its start
timestamp is not yet fixed. Older timestamps may overlap if the system
clock was stepped between runs. The latter isn't /too/ bad from a user
perspective, though it's confusing as a developer. We probably will
only end up showing the more recent recording for a given
timestamp anyway. But the former is quite annoying. It means we have
to throw away part of the SourceBuffer that we may want to seek back
(causing UI pauses when that happens) or keep our own spare copy of it
(memory bloat). I'd like to avoid the whole mess.
* Another approach is to use timestamps that are guaranteed to be in
the correct order but that may have gaps. In particular, a timestamp
of (recording_id * max_recording_duration) + time_within_recording.
But again seeking isn't instantaneous. In my experiments, there's a
visible pause between segments that drives me nuts.
* Finally, the approach that led me to this schema change. Use
timestamps that place each segment after the one before, possibly with
an intentional gap between runs (to force a wait where we have an
actual gap). This should make the browser's natural playback behavior
work properly: it never goes to an incorrect place, and it only waits
when/if we want it to. We have to maintain a mapping between its
timestamps and segment ids but that's doable.
This commit is only the schema change; the new data aren't exposed in
the API yet, much less used by a UI.
Note that stream.next_recording_id became stream.cum_recordings. I made
a slight definition change in the process: recording ids for new streams
start at 0 rather than 1. Various tests changed accordingly.
The upgrade process makes a best effort to backfill these new fields,
but of course it doesn't know the total duration or number of runs of
previously deleted rows. That's good enough.
When compiled with cargo build --features=analytics and enabled via
moonfire-nvr run --object-detection, this runs object detection on every
sub stream frame through an Edge TPU (a Coral USB accelerator) and logs
the result.
This is a very small step toward a working system. It doesn't actually
record the result in the database or send it out on the live stream yet.
It doesn't support running object detection at a lower frame rate than
the sub streams come in at either. To address those problems, I need to
do some refactoring. Currently moonfire_db::writer::Writer::Write is the
only place that knows the duration of the frame it's about to flush,
before it gets added to the index or sent out on the live stream. I
don't want to do the detection from there; I'd prefer the moonfire_nvr
crate. So I either need to introduce an analytics callback or move a
bunch of that logic to the other crate.
Once I do that, I need to add database support (although I have some
experiments for that in moonfire-playground) and API support, then some
kind of useful frontend.
Note edgetpu.tflite is taken from the Apache 2.0-licensed
https://github.com/google-coral/edgetpu,
test_data/mobilenet_ssd_v2_coco_quant_postprocess_edgetpu.tflite. The
following page says it's fine to include Apache 2.0 stuff in GPLv3
projects:
https://www.apache.org/licenses/GPL-compatibility.html
* As discussed in #48, say "The Moonfire NVR Authors" at the top of
every file rather than whoever created that file. Have one AUTHORS
file listing everyone.
* Consistently call it a "security camera network video recorder" rather
than "security camera digital video recorder".
My dad's "GW-GW4089IP" cameras use separate ports for the main and sub
streams:
rtsp://192.168.1.110:5050/H264?channel=0&subtype=0&unicast=true&proto=Onvif
rtsp://192.168.1.110:5049/H264?channel=0&subtype=1&unicast=true&proto=Onvif
Previously I could get one of the streams to work by including :5050 or
:5049 in the host field of the camera. But not both. Now make the
camera's host field reflect the ONVIF port (which is also non-standard
on these cameras, :85). It's not directly used yet but probably will be
sooner or later. Make each stream know its full URL.
My installation recently somehow ended up with a recording with a
duration of 503793844 90,000ths of a second, way over the maximum of 5
minutes. (Looks like the machine was pretty unresponsive at the time
and/or having network problems.)
When this happens, the system really spirals. Every flush afterward (12
per minute with my installation) fails with a CHECK constraint failure
on the recording table. It never gives up on that recording. /var/log
fills pretty quickly as this failure is extremely verbose (a stack
trace, and a line for each byte of video_index). Eventually the sample
file dirs fill up too as it continues writing video samples while GC is
stuck. The video samples are useless anyway; given that they're not
referenced in the database, they'll be deleted on next startup.
This ensures the offending recording is never added to the database, so
we don't get the same persistent problem. Instead, writing to the
recording will fail. The stream will drop and be retried. If the
underlying condition that caused a too-long recording (many
non-key-frames, or the camera returning a crazy duration, or the
monotonic clock jumping forward extremely, or something) has gone away,
the system should recover.
This is mostly just "cargo fix --edition" + Cargo.toml changes.
There's one fix for upgrading to NLL in db/writer.rs:
Writer::previously_opened wouldn't build with NLL because of a
double-borrow the previous borrow checker somehow didn't catch.
Restructure to avoid it.
I'll put elective NLL changes in a following commit.
This is a minor code size reduction - instead of being monomorphized
into four variants (according to "cargo llvm-lines"), it's now
monomorphized into two. The stripped release binary on macOS is about
8kB smaller (0.15%). Not a huge improvement but better than nothing.
Benchmarks seem unchanged (though they have a lot of variance).
I moved the clocks member from LockedDatabase to Database to make this happen,
so the new DatabaseGuard (replacing a direct MutexGuard<LockedDatabase>) can
access it before acquiring the lock.
I also made the type of clock a type parameter of Database (and so several
other things throughout the system). This allowed me to drop the Arc<>, but
more importantly it means that the Clocks trait doesn't need to stay
object-safe. I plan to take advantage of that shortly.
* separate these out into a new file, writer.rs, as dir.rs was getting
unwieldy.
* extract traits for the parts of SampleFileDir and std::fs::File they needed;
set up mock implementations.
* move clock.rs to a new base crate to be accessible from the db crate.
* add tests that exercise all the retry paths.
* bugfix: account for the new recording's bytes when calculating how much to
delete.
* bugfix: when retrying an unlink failure in collect_garbage, we shouldn't
warn about all the recordings no longer existing. Do this by retrying each
step rather than the whole procedure again.
* avoid double-panic scenarios, which I hit while tweaking the mocks. These
are quite annoying to debug as Rust doesn't print information about either
panic. I ended up using lldb to get a backtrace. Better to be cautious about
what we're doing when already panicking.
* give more context on raw::insert_recording errors, which I hit as well while
tweaking the new tests.
Every recording it starts must be sent to the syncer with at least one sample
written. It will try forever (unless the channel is down, then panic). This
avoids the situation in which it prevents something in the uncommitted
VecDeque from ever being synced and thus any further recordings from being
flushed.
The new approach is to, rather than panicking, retry forever. The assumption
is that if a given operation is failing, a following operation is unlikely to
succeed, so it's simpler to just keep trying the earlier one than come up with
ways to undo it and proceed with later operations.
I still need to apply this approach to the Writer class. It currently unwraps
(crashes) or just gives up on a recording without ever sending it to the
Syncer. Given that recordings are all synced in order, that means further ones
can never be synced.
I mistakenly thought these had to be monomorphized. (The FnOnce still
does, until rust-lang/rfcs#1909 is implemented.) Turns out this way works
fine. It should result in less compile time / code size, though I didn't check
this.
This improves the practicality of having many streams (including the doubling
of streams by having main + sub streams for each camera). With these tuned
properly, extra streams don't cause any extra write cycles in normal or error
cases. Consider the worst case in which each RTSP session immediately sends a
single frame and then fails. Moonfire retries every second, so this would
formerly cause one commit per second per stream. (flush_if_sec=0 preserves
this behavior.) Now the commits can be arbitrarily infrequent by setting
higher values of flush_if_sec.
WARNING: this isn't production-ready! I hacked up dir.rs to make tests pass
and "moonfire-nvr run" work in the best-case scenario, but it doesn't handle
errors gracefully. I've been debating what to do when writing a recording
fails. I considered "abandoning" the recording then either reusing or skipping
its id. (in the latter case, marking the file as garbage if it can't be
unlinked immediately). I think now there's no point in abandoning a recording.
If I can't write to that file, there's no reason to believe another will work
better. It's better to retry that recording forever, and perhaps put the whole
directory into an error state that stops recording until those writes go
through. I'm planning to redesign dir.rs to make this happen.
It should reduce compile time / memory usage to put quite a bit of the code
into a separate crate. I also intend to limit visibility of some things to
only within the db crate, but that's for a future change. This is the smallest
move that will compile.
The filenames now represent composite ids (stream id + recording id) rather
than a separate uuid system with its own reservation for a few benefits:
* This provides more information when there are inconsistencies.
* This avoids the need for managing the reservations during recording. I
expect this to simplify delaying flushing of newly written sample files.
Now the directory has to be scanned at startup for files that never got
written to the database, but that's acceptably fast even with millions of
files.
* Less information to keep in memory and in the recording_playback table.
I'd considered using one directory per stream, which might help if the
filesystem has trouble coping with huge directories. But that would mean each
dir has to be fsync()ed separately (more latency and/or more multithreading).
So I'll stick with this until I see concrete evidence of a problem that would
solve.
Test coverage of the error conditions is poor. I plan to do some restructuring
of the db/dir code, hopefully making steps toward testability along the way.
This is still pretty basic support. There's no config UI support for
renaming/moving the sample file directories after they are created, and no
error checking that the files are still in the expected place. I can imagine
sysadmins getting into trouble trying to change things. I hope to address at
least some of that in a follow-up change to introduce a versioning/locking
scheme that ensures databases and sample file dirs match in some way.
A bonus change that kinda got pulled along for the ride: a dialog pops up in
the config UI while a stream is being tested. The experience was pretty bad
before; there was no indication the button worked at all until it was done,
sometimes many seconds later.
This allows each camera to have a main and a sub stream. Previously there was
a field in the schema for the sub stream's url, but it didn't do anything. Now
you can configure individual retention for main and sub streams. They show up
grouped in the UI.
No support for upgrading from schema version 1 yet.
My odroid setup has been occasionally (about once a week) losing about 15
seconds of recordings on all cameras. I'm not sure why. So I'm labelling all
the likely suspect spots and logging if any of them takes longer than a
second. I think this will give me more information; hopefully narrow it down
to network or local disk I/O.
This significantly improves safety of the ffmpeg interface. The complex
ABIs aren't accessed directly from Rust. Instead, I have a small C
wrapper which uses the ffmpeg C API and the C headers at compile-time to
determine the proper ABI in the same way any C program using ffmpeg
would, so that the ABI doesn't have to be duplicated in Rust code.
I've tested with ffmpeg 2.x and ffmpeg 3.x; it seems to work properly
with both where before ffmpeg 3.x caused segfaults.
It still depends on ABI compatibility between the compiled and running
versions. C programs need this, too, and normal shared library
versioning practices provide this guarantee. But log both versions on
startup for diagnosing problems with this.
Fixes#7
It had an Arc which in hindsight isn't necessary; the actual video index
generation is fast anyway. This saves a couple pointers per cache entry and
the overhead of chasing them. LruCache itself also has some extra pointers on
it but that's something to address another day.
This reduces the working set by another 960 bytes for a typical one-hour recording, improving cache efficiency a bit more.
8 bytes from SampleIndexIterator:
* reduce the three "bytes" fields to two. Doing so as "bytes_key" vs
"bytes_nonkey" slowed it down a bit, perhaps because the "bytes" is
needed right away and requires a branch. But "bytes" vs "bytes_other"
seems fine. Looks like it can do this with cmovs in parallel with other
stuff.
* stuff "is_key" into the "i" field.
8 bytes from recording::Segment itself:
* make "frames" and "key_frame" u16s
* stuff "trailing_zero" into "video_sample_entry_id"
As described in design/time.md:
* get the realtime-monotonic once at the start of a run and use the
monotonic clock afterward to avoid problems with local time steps
* on every recording, try to correct the latest local_time_delta at up
to 500 ppm
Let's see how this works...
The test ensures it solves the problem of the initial buffering throwing off
the start time of the first segment.
Along the way, I tested and fixed the new TrailingZero flag; it wasn't being
set.
This is as described in design/time.md. Other aspects of that design
(including using the monotonic clock and adjusting the durations to compensate
for camera clock frequency error) are not implemented yet. No new tests yet.
Just trying to get some flight miles on these ideas as soon as I can.
The advantages of the new schema are:
* overlapping recordings can be unambiguously described and viewed.
This is a significant problem right now; the clock on my cameras appears to
run faster than the (NTP-synchronized) clock on my NVR. Thus, if an
RTSP session drops and is quickly reconnected, there's likely to be
overlap.
* less I/O is required to view mp4s when there are multiple cameras.
This is a pretty dramatic difference in the number of database read
syscalls with pragma page_size = 1024 (605 -> 39 in one test),
although I'm not sure how much of that maps to actual I/O wait time.
That's probably as dramatic as it is due to overflow page chaining.
But even with larger page sizes, there's an improvement. It helps to
stop interleaving the video_index fields from different cameras.
There are changes to the JSON API to take advantage of this, described
in design/api.md.
There's an upgrade procedure, described in guide/schema.md.
This test is copied from the C++ implementation. It ensures the timestamps are
calculated accurately from the pts rather than using ffmpeg's estimated
duration. The Rust implementation was doing the easy-but-inaccurate thing, so
fix that to make the test pass.
Additionally, I did this with a code structure that should ensure the Rust
code never drops a Writer without indicating to the syncer that its uuid is
abandoned. Such a bug essentially leaks the partially-written file, although a
restart would cause it to be properly unlinked and marked as such. There are
no tests (yet) that exercise this scenario, though.
I should have submitted/pushed more incrementally but just played with it on
my computer as I was learning the language. The new Rust version more or less
matches the functionality of the current C++ version, although there are many
caveats listed below.
Upgrade notes: when moving from the C++ version, I recommend dropping and
recreating the "recording_cover" index in SQLite3 to pick up the addition of
the "video_sync_samples" column:
$ sudo systemctl stop moonfire-nvr
$ sudo -u moonfire-nvr sqlite3 /var/lib/moonfire-nvr/db/db
sqlite> drop index recording_cover;
sqlite3> create index ...rest of command as in schema.sql...;
sqlite3> ^D
Some known visible differences from the C++ version:
* .mp4 generation queries SQLite3 differently. Before it would just get all
video indexes in a single query. Now it leads with a query that should be
satisfiable by the covering index (assuming the index has been recreated as
noted above), then queries individual recording's indexes as needed to fill
a LRU cache. I believe this is roughly similar speed for the initial hit
(which generates the moov part of the file) and significantly faster when
seeking. I would have done it a while ago with the C++ version but didn't
want to track down a lru cache library. It was easier to find with Rust.
* On startup, the Rust version cleans up old reserved files. This is as in the
design; the C++ version was just missing this code.
* The .html recording list output is a little different. It's in ascending
order, with the most current segment shorten than an hour rather than the
oldest. This is less ergonomic, but it was easy. I could fix it or just wait
to obsolete it with some fancier JavaScript UI.
* commandline argument parsing and logging have changed formats due to
different underlying libraries.
* The JSON output isn't quite right (matching the spec / C++ implementation)
yet.
Additional caveats:
* I haven't done any proof-reading of prep.sh + install instructions.
* There's a lot of code quality work to do: adding (back) comments and test
coverage, developing a good Rust style.
* The ffmpeg foreign function interface is particularly sketchy. I'd
eventually like to switch to something based on autogenerated bindings.
I'd also like to use pure Rust code where practical, but once I do on-NVR
motion detection I'll need to existing C/C++ libraries for speed (H.264
decoding + OpenCL-based analysis).