Some caveats:
* it doesn't record the peer IP yet, which makes it harder to verify
sessions are valid. This is a little annoying to do in hyper now
(see hyperium/hyper#1410). The direct peer might not be what we want
right now anyway because there's no TLS support yet (see #27). In
the meantime, the sane way to expose Moonfire NVR to the Internet is
via a proxy server, and recording the proxy's IP is not useful.
Maybe better to interpret a RFC 7239 Forwarded header (and/or
the older X-Forwarded-{For,Proto} headers).
* it doesn't ever use Secure (https-only) cookies, for a similar reason.
It's not safe to use even with a tls proxy until this is fixed.
* there's no "moonfire-nvr config" support for inspecting/invalidating
sessions yet.
* in debug builds, logging in is crazy slow. See libpasta/libpasta#9.
Some notes:
* I removed the Javascript "no-use-before-defined" lint, as some of
the functions form a cycle.
* Fixed#20 along the way. I needed to add support for properly
returning non-OK HTTP statuses to signal unauthorized and such.
* I removed the Access-Control-Allow-Origin header support, which was
at odds with the "SameSite=lax" in the cookie header. The "yarn
start" method for running a local proxy server accomplishes the same
thing as the Access-Control-Allow-Origin support in a more secure
manner.
Fixes#60
The reqwest dependency is significant because the old version required
an old version of openssl, complicating compilation on newer platforms.
reqwest also pulled in old/duplicate versions of hyper, tokio, etc.
Nice to drop a lot of that cruft.
I left rusqlite and uuid alone because they had breaking changes I
didn't want to mess with at the moment.
Bumped the minimum Rust version to 1.30.0, as required by the
new encoding_rs crate (and perhaps other things).
This is a minor code size reduction - instead of being monomorphized
into four variants (according to "cargo llvm-lines"), it's now
monomorphized into two. The stripped release binary on macOS is about
8kB smaller (0.15%). Not a huge improvement but better than nothing.
Benchmarks seem unchanged (though they have a lot of variance).
I want to use hyper::server::Request::bytes_mut(), so an update is
needed. Update everything at once. Most notably, the http-serve update
starts using the http crate types for some things. (More to come.)
The new behavior eliminates a couple unpleasant edge cases in which it
would never flush:
* if all recording stops, whatever was unflushed would stay that way
* if every recording attempt produces a 0-duration recording (such as if the
camera sends only one frame and thus no PTS delta can be calculated),
the list of recordings to flush would continue to grow
I moved the clocks member from LockedDatabase to Database to make this happen,
so the new DatabaseGuard (replacing a direct MutexGuard<LockedDatabase>) can
access it before acquiring the lock.
I also made the type of clock a type parameter of Database (and so several
other things throughout the system). This allowed me to drop the Arc<>, but
more importantly it means that the Clocks trait doesn't need to stay
object-safe. I plan to take advantage of that shortly.
* separate these out into a new file, writer.rs, as dir.rs was getting
unwieldy.
* extract traits for the parts of SampleFileDir and std::fs::File they needed;
set up mock implementations.
* move clock.rs to a new base crate to be accessible from the db crate.
* add tests that exercise all the retry paths.
* bugfix: account for the new recording's bytes when calculating how much to
delete.
* bugfix: when retrying an unlink failure in collect_garbage, we shouldn't
warn about all the recordings no longer existing. Do this by retrying each
step rather than the whole procedure again.
* avoid double-panic scenarios, which I hit while tweaking the mocks. These
are quite annoying to debug as Rust doesn't print information about either
panic. I ended up using lldb to get a backtrace. Better to be cautious about
what we're doing when already panicking.
* give more context on raw::insert_recording errors, which I hit as well while
tweaking the new tests.
There may be considerable lag between being fully written and being committed
when using the flush_if_sec feature. Additionally, this is a step toward
listing and viewing recordings before they're fully written. That's a
considerable delay: 60 to 120 seconds for the first recording of a run,
0 to 60 seconds for subsequent recordings.
These recordings aren't yet included in the information returned by
/api/?days=true. They probably should be, but small steps.
I want to start having the db.rs version augment this with the uncommitted
recordings, and it's nice to have the separation of the raw db vs augmented
versions. Also, this fits with the general theme of shrinking db.rs a bit.
I had to put the raw video_sample_entry_id into the rows rather than
the video_sample_entry Arc. In hindsight, this is better anyway: the common
callers don't need to do the btree lookup and arc clone on every row. I think
I'd originally done it that way only because I was quite new to rust and
didn't understand that db could be used from within the row callback given
that both borrows are immutable.
In hindsight, the "post_tx" step in the upgrade process introduced in
e7f5733 doesn't make sense. If the procedure fails at this stage, nothing says
it still needs to be completed. If the sample file dirs have to be updated
after the database, then there should be another database version to mark that
it's fully completed, and indeed that's the purpose version 3 serves. So get
rid of the Upgrader trait and just go back to a simple run function per
version.
In the case of the sample file dir metadata, it actually can happen before the
database transaction; the stuff written to the database later just needs to be
consistent with what it finds if there's an existing metadata file from a
half-completed update.
For safety, ensure there are no unexpected directory contents before
upgrading 1->2, and ensure the metadata matches before upgrading 2->3.
I want to be able to use it in etags without having to do a full scan of the
recording_playback in advance, which would greatly increase time to first
byte. I probably will even use it in urls to ensure the segments they point to
are stable. I haven't actually done this yet - it will wait until I implement
serving unflushed recordings - but I want to get the schema set up properly.
Every recording it starts must be sent to the syncer with at least one sample
written. It will try forever (unless the channel is down, then panic). This
avoids the situation in which it prevents something in the uncommitted
VecDeque from ever being synced and thus any further recordings from being
flushed.
The new approach is to, rather than panicking, retry forever. The assumption
is that if a given operation is failing, a following operation is unlikely to
succeed, so it's simpler to just keep trying the earlier one than come up with
ways to undo it and proceed with later operations.
I still need to apply this approach to the Writer class. It currently unwraps
(crashes) or just gives up on a recording without ever sending it to the
Syncer. Given that recordings are all synced in order, that means further ones
can never be synced.
I mistakenly thought these had to be monomorphized. (The FnOnce still
does, until rust-lang/rfcs#1909 is implemented.) Turns out this way works
fine. It should result in less compile time / code size, though I didn't check
this.
As noted in schema.sql, this can be used for disambiguation. It also may be
useful in diagnosing data integrity problems.
Also, sneak in a couple minor improvements: better diagnostics in a couple
places, fix to 1->2 upgrade procedure.
This improves the practicality of having many streams (including the doubling
of streams by having main + sub streams for each camera). With these tuned
properly, extra streams don't cause any extra write cycles in normal or error
cases. Consider the worst case in which each RTSP session immediately sends a
single frame and then fails. Moonfire retries every second, so this would
formerly cause one commit per second per stream. (flush_if_sec=0 preserves
this behavior.) Now the commits can be arbitrarily infrequent by setting
higher values of flush_if_sec.
WARNING: this isn't production-ready! I hacked up dir.rs to make tests pass
and "moonfire-nvr run" work in the best-case scenario, but it doesn't handle
errors gracefully. I've been debating what to do when writing a recording
fails. I considered "abandoning" the recording then either reusing or skipping
its id. (in the latter case, marking the file as garbage if it can't be
unlinked immediately). I think now there's no point in abandoning a recording.
If I can't write to that file, there's no reason to believe another will work
better. It's better to retry that recording forever, and perhaps put the whole
directory into an error state that stops recording until those writes go
through. I'm planning to redesign dir.rs to make this happen.
It should reduce compile time / memory usage to put quite a bit of the code
into a separate crate. I also intend to limit visibility of some things to
only within the db crate, but that's for a future change. This is the smallest
move that will compile.
The filenames now represent composite ids (stream id + recording id) rather
than a separate uuid system with its own reservation for a few benefits:
* This provides more information when there are inconsistencies.
* This avoids the need for managing the reservations during recording. I
expect this to simplify delaying flushing of newly written sample files.
Now the directory has to be scanned at startup for files that never got
written to the database, but that's acceptably fast even with millions of
files.
* Less information to keep in memory and in the recording_playback table.
I'd considered using one directory per stream, which might help if the
filesystem has trouble coping with huge directories. But that would mean each
dir has to be fsync()ed separately (more latency and/or more multithreading).
So I'll stick with this until I see concrete evidence of a problem that would
solve.
Test coverage of the error conditions is poor. I plan to do some restructuring
of the db/dir code, hopefully making steps toward testability along the way.
The idea is to avoid the problems described in src/schema.proto; those
possibilities have bothered me for a while. A bonus is that (in a future
commit) it can replace the sample file uuid scheme in favor of using
<camera_uuid>-<stream_type>/<recording_id> for several advantages:
* on data integrity problems (specifically, extra sample files), more
information to use to understand what happened.
* no more reserving sample files prior to using them. This avoids some extra
database transactions on startup (now there's an extra two total rather
than an extra one per stream). It also simplifies an upcoming change I
want to make in which some streams are not flushed immediately, reducing
the write load significantly (maybe one per minute total rather than one
per stream per minute).
* get rid of eight bytes per playback cache entry in RAM (and nine bytes
per recording_playback row on flash).
The implementation is still pretty rough in places:
* Lack of tests.
* Poor ode organization. In particular, SampleFileDirectory::write_meta
shouldn't be exposed beyond db. I'm thinking about moving db.rs and
SampleFileDirectory to a new crate, moonfire_nvr_db. This would improve
compile times as well.
* No tooling for renaming a sample file directory.
* Config subcommand still panics in conditions that can be reasonably
expected to happen.
This is still pretty basic support. There's no config UI support for
renaming/moving the sample file directories after they are created, and no
error checking that the files are still in the expected place. I can imagine
sysadmins getting into trouble trying to change things. I hope to address at
least some of that in a follow-up change to introduce a versioning/locking
scheme that ensures databases and sample file dirs match in some way.
A bonus change that kinda got pulled along for the ride: a dialog pops up in
the config UI while a stream is being tested. The experience was pretty bad
before; there was no indication the button worked at all until it was done,
sometimes many seconds later.
This avoids having codec-specific logic to synthesize it in db.rs. It's not
too much of a problem now with only H.264 support, but it'd be a pain when
supporting H.265 and other codecs.
This allows each camera to have a main and a sub stream. Previously there was
a field in the schema for the sub stream's url, but it didn't do anything. Now
you can configure individual retention for main and sub streams. They show up
grouped in the UI.
No support for upgrading from schema version 1 yet.
This is a wash in terms of lines of code now, but it makes it a bit easier to
maintain as I make changes to the schema (such as separating out streams from
cameras), and it helps ensure the tests reflect reality.
My odroid setup has been occasionally (about once a week) losing about 15
seconds of recordings on all cameras. I'm not sure why. So I'm labelling all
the likely suspect spots and logging if any of them takes longer than a
second. I think this will give me more information; hopefully narrow it down
to network or local disk I/O.
The Javascript is pretty amateurish I'm sure but at least it's something to
iterate from. It's already much more pleasant for browsing through videos in
several ways:
* more responsive to load only a day at a time rather than 90+ days
* much easier to see the same time segment on several cameras
* more pleasant to have the videos load as a popup rather than a link
that blows away your position in an enormous list
* exposes the fancier .mp4 generation options: splitting at lengths
other than the default, trimming to an arbitrary start and end time,
including a subtitle track with timestamps.
There's a slight regression in functionality: I didn't match the former
top-level page which showed how much camera used of its disk allocation and
the total duration of video. This is exposed in the JSON API, so it shouldn't
be too hard to add back.
The recording::Segment was constructing a segment with no frames in it, which
was causing a panic when appending a zero-length stts to the Slices. Fix this
in a couple ways:
* Slices::append should return Err rather than panic. No reason to crash the
whole program when we have trouble serving a single .mp4 request.
* recording::Segment shouldn't produce zero-frame segments
I had an assert that fired in this case, dating back to when I hadn't plumbed
Result returns through much of .mp4 construction. Now I have, so there's no
excuse in having an assert here. Change to an error return, and tweak it to
not fire in the zero-duration case.
Also fix a problem in the test harness; I hadn't finished converting it for
multi-recording tests, and it was returning the wrong recording.
Because of that, I seem to have stumbled across a related problem in which
asking for zero duration of a non-zero duration recording will return a
recording::Segment with no frames, which will cause panics because its
corresponding .mp4 slices are zero-length. I just adjusted the panic message
here; I'll follow up with changes to address that.
* CameraDayKey::bounds (used to generate the start and end times of days in
the returned JSON) returned UTC, not matching what recordings were mapped
into that day. So fetching a day with its given bounds would return
something different. Test and fix it.
* Several time-related tests weren't calling testutil::init(), so they weren't
fixing the time zone to the expected America/Los_Angeles. If the machine
time is set to something else, they would break.
I think this is an ffmpeg bug, which I plan to report. In the meantime, this
makes the tests pass. Long-term, even if ffmpeg fixes this, I probably don't
want to continue doing acceptance tests against whatever version of ffmpeg
happens to be installed - my real targets of interest are the latest versions
of Chrome, Firefox, Safari, QuickTime, and VLC.
This was causing Firefox to fail to play multipart .mp4s which trimmed away a
prefix. In the developer console, it said NS_ERROR_DOM_MEDIA_METADATA_ERR
without giving any RESULT_DETAIL, making it a pain to diagnose. Given that the
stss is supposed to be needed for seeking, I'm surprised it didn't have any
immediately obvious impact on Chrome or VLC. Maybe they just took longer to
seek than otherwise necessary.
The bug was that when keeping track of the "next frame num" while constructing
the .mp4, I appended the number in the underlying recording, not the number
post-trimming. That meant following segments used the wrong numbers. In some
cases, it caused it to exceed the total number of samples in the generated
.mp4, which seems to be what Firefox was complaining about. Running the result
through "ffmpeg -i bad.mp4 -c copy -f mp4 good.mp4" just trimmed away the most
obviously invalid ones, leaving others that didn't point to the frames they
meant to. That was enough to make Firefox start playing the file. /shruggie
The existing tests were all with a single segment, so I added a new one to
catch this. I also added a Debug implementation to recording::Segment and
mp4::Segment.
This was totally broken in commit 1cf27c18. It would serve bytes from the
beginning of the sample file in question, not from the start of the given
range.
it should be exactly 1, but was slightly more because the fraction was
incorrectly 1 rather than 0. I'm not sure if any actual players care about
this, but it was something I noticed when looking into strange edit list
behavior.
This is intended to support HTML5 Media Source Extensions, which I expect to
be the most practical way to make a good web UI with a proper scrub bar and
such.
This feature has had very limited testing on Chrome and Firefox, and that was
not entirely successful. More work is needed before it's usable, but this
seems like a helpful progress checkpoint.
This significantly improves safety of the ffmpeg interface. The complex
ABIs aren't accessed directly from Rust. Instead, I have a small C
wrapper which uses the ffmpeg C API and the C headers at compile-time to
determine the proper ABI in the same way any C program using ffmpeg
would, so that the ABI doesn't have to be duplicated in Rust code.
I've tested with ffmpeg 2.x and ffmpeg 3.x; it seems to work properly
with both where before ffmpeg 3.x caused segfaults.
It still depends on ABI compatibility between the compiled and running
versions. C programs need this, too, and normal shared library
versioning practices provide this guarantee. But log both versions on
startup for diagnosing problems with this.
Fixes#7
The benchmarks don't get compiled with the standard "cargo test";
they require "cargo +nightly bench --features=nightly", so I didn't notice
they were broken in the previous commit. Now fixed.
This makes a huge difference in the reported time - 863 usec rather than 6
milliseconds on my laptop. Part of the difference is in reqwest client setup
(it apparently initializes a SSL_CTX that is never used in this test), part
fresh connections vs keepalive, part I don't know what. None of it seems
relevant to the logic I want to test.
serve_generated_bytes is >3X faster. One caveat is that the reactor thread may
stall when reading from the memory-mapped slice. Moonfire NVR is basically a
single-user program, so that may not be so bad, but we'll see.
It had an Arc which in hindsight isn't necessary; the actual video index
generation is fast anyway. This saves a couple pointers per cache entry and
the overhead of chasing them. LruCache itself also has some extra pointers on
it but that's something to address another day.
This reduces the working set by another 960 bytes for a typical one-hour recording, improving cache efficiency a bit more.
8 bytes from SampleIndexIterator:
* reduce the three "bytes" fields to two. Doing so as "bytes_key" vs
"bytes_nonkey" slowed it down a bit, perhaps because the "bytes" is
needed right away and requires a branch. But "bytes" vs "bytes_other"
seems fine. Looks like it can do this with cmovs in parallel with other
stuff.
* stuff "is_key" into the "i" field.
8 bytes from recording::Segment itself:
* make "frames" and "key_frame" u16s
* stuff "trailing_zero" into "video_sample_entry_id"
There were Nagle's algorithm delays in both the "fresh_client" and
"reuse_client" versions of the .mp4 serving benchmark. Now performance is much
more consistent.
* don't store sizes of mp4-format sample indexes; recalculate them.
* keep SampleIndexIterator position as a u32 rather than a usize.
This is 960 bytes for a 60-minute mp4; another small cache usage improvement.
For a one-hour recording, this is about 2 KiB, so a decent chunk of a
Raspberry Pi 2's L1 cache. Generating the Slices and searching/scanning
it should be a bit faster and pollute the cache less.
This is a pretty small optimization now when transferring a decent chunk
of the moov or mdat, but it's easy enough to do. It will be much more
noticeable if I end up interleaving the captions between each key frame.
This came up when I tried using the "bundled" feature of rusqlite. Its build
script passes -DSQLITE_DEFAULT_FOREIGN_KEYS=1, which caused a test to fail.
Fix the bug that this option revealed, and set the pragma so we'll catch
such problems in the future even when using a system library not compiled in
this way.
This was completely wrong: it overflowed on large filesystems and
double-counted the used bytes.
The new logic is still imperfect in that if there are a bunch of files in the
process of being deleted (moved from recording to reserved_sample_files but
not yet unlinked), they'll be taken out of the total capacity. Maybe it should
stat everything in the sample file directory instead of relying on the
recording table. It's definitely an improvement, though.
This page was noticeably slower than necessary because the recording_cover
index wasn't actually covering the query. Both the schema for new databases
and the upgrade query were broken (and not even in the same way).
No new schema version to correct this, at least for now. I'll probably have
another reason to change the schema soon anyway and can throw this in.
This makes it easier to understand which options are valid with each
command.
Additionally, there's more separation of implementations. The most
obvious consequence is that "moonfire-nvr ts ..." no longer uselessly
locks/opens a database.
* add a --ts subcommand to convert between numeric and human-readable
representations. This is handy when directly inspecting the SQLite database
or API output.
* also take the human-readable form in the web interface's camera view.
* to reduce confusion, when using trim=true on the web interface's camera
view, trim the displayed starting and ending times as well as the actual
.mp4 file links.
These are currently the only thing which require a nightly Rust. I haven't run
them since adding the feature gates. The feature gates were slightly broken,
and the actual benchmarks had bitrotted a bit. Fix these things. Also put them
into a separate submodule from the regular tests, so that not as many
feature gates (#[cfg(feature="nightly")]) are required.