mirror of
				https://github.com/scottlamb/moonfire-nvr.git
				synced 2025-10-29 15:55:01 -04:00 
			
		
		
		
	update design docs for new-schema branch changes
This commit is contained in:
		
							parent
							
								
									91636d3193
								
							
						
					
					
						commit
						65e68d3255
					
				| @ -32,21 +32,8 @@ syntax = "proto3"; | ||||
| 
 | ||||
| // Metadata stored in sample file dirs as "<dir>/meta". This is checked | ||||
| // against the metadata stored within the database to detect inconsistencies | ||||
| // between the directory and database, including the following: | ||||
| // | ||||
| // * sample file directory's disk not being mounted. | ||||
| // * mixing up mount points of two sample file directories belonging to the | ||||
| //   same database. | ||||
| // * directory renames not properly recorded in the database. | ||||
| // * restoration of the database from backup but not the sample file | ||||
| //   directory. | ||||
| // * restoration of the sample file directory but not the database. | ||||
| // * two sample file directory paths pointed at the same inode via symlinks | ||||
| //   or non-canonical paths. (Note that flock(2) has a design flaw in which | ||||
| //   multiple file descriptors can share a lock, so the current locking scheme | ||||
| //   is not sufficient to detect this otherwise.) | ||||
| // * database and sample file directories forked from the same version, opened | ||||
| //   the same number of times, then crossed. | ||||
| // between the directory and database, such as those described in | ||||
| // design/schema.md. | ||||
| message DirMeta { | ||||
|   // A uuid associated with the database, in binary form. dir_uuid is strictly | ||||
|   // more powerful, but it improves diagnostics to know if the directory | ||||
|  | ||||
| @ -1,7 +1,6 @@ | ||||
| # Moonfire NVR API | ||||
| 
 | ||||
| Status: **unstable**. This is an early draft; the API may change without | ||||
| warning. | ||||
| Status: **current**. | ||||
| 
 | ||||
| ## Objective | ||||
| 
 | ||||
|  | ||||
							
								
								
									
										222
									
								
								design/schema.md
									
									
									
									
									
								
							
							
						
						
									
										222
									
								
								design/schema.md
									
									
									
									
									
								
							| @ -1,7 +1,6 @@ | ||||
| # Moonfire NVR Storage Schema | ||||
| 
 | ||||
| Status: **current**. This is largely implemented; there is optimization and | ||||
| testing work left to do. | ||||
| Status: **current**. | ||||
| 
 | ||||
| This is the initial design for the most fundamental parts of the Moonfire NVR | ||||
| storage schema. See also [guide/schema.md](../guide/schema.md) for more | ||||
| @ -128,7 +127,7 @@ together. | ||||
| 
 | ||||
| Each recording is stored in two places: | ||||
| 
 | ||||
| * the recording samples directory, intended to be stored on spinning disk. | ||||
| * a sample file directory, intended to be stored on spinning disk. | ||||
|   Each file in this directory is simply a concatenation of the compressed, | ||||
|   timestamped video samples (also called "packets" or encoded frames), as | ||||
|   received from the camera. In MPEG-4 terminology (see [ISO | ||||
| @ -225,74 +224,213 @@ The design avoids the need for the following constraints: | ||||
| * Serving close to live. It's possible to serve a recording as it is being | ||||
|   written. | ||||
| 
 | ||||
| ### Lifecycle of a sample file directory | ||||
| 
 | ||||
| One major disadvantage to splitting the state in two (the SQLite3 database in | ||||
| flash and the sample file directories on spinning disk) is the possibility of | ||||
| inconsistency. There are many ways this could arise: | ||||
| 
 | ||||
| * a sample file directory's disk is unexpectedly not mounted due to hardware | ||||
|   failure or misconfiguration. | ||||
| * the administrator mixing up the mount points of two filesystems holding | ||||
|   different sample file directories. | ||||
| * the administrator renaming a sample file directory without updating the database. | ||||
| * the administrator restoring the database from backup but not the sample file | ||||
|   directory, or vice versa. | ||||
| * the administrator providing two sample file directory paths pointed at the | ||||
|   same inode via symlinks or non-canonical paths. (Note that flock(2) has a | ||||
|   design flaw in which multiple file descriptors can share a lock, so the current | ||||
|   locking scheme is not sufficient to detect this otherwise.) | ||||
| * database and sample file directories forked from the same version, opened | ||||
|   the same number of times, then crossed. | ||||
| 
 | ||||
| To combat this, each sample file directory has some metadata its database row | ||||
| and stored file called `meta`. These track uuids associated with the database | ||||
| and directory to avoid mixups. They also track sequence numbers and uuids | ||||
| associated with "opens": each time the database has been opened in read/write | ||||
| mode. | ||||
| 
 | ||||
| ```sql | ||||
| create table open ( | ||||
|   id integer primary key, | ||||
|   uuid blob unique not null check (length(uuid) = 16) | ||||
| ); | ||||
| 
 | ||||
| create table sample_file_dir ( | ||||
|   id integer primary key, | ||||
|   path text unique not null, | ||||
|   uuid blob unique not null check (length(uuid) = 16), | ||||
| 
 | ||||
|   -- The last (read/write) open of this directory which fully completed. | ||||
|   -- See schema.proto:DirMeta for a more complete description. | ||||
|   last_complete_open_id integer references open (id) | ||||
| ); | ||||
| ``` | ||||
| 
 | ||||
| ```proto | ||||
| message DirMeta { | ||||
|   // A uuid associated with the database, in binary form. dir_uuid is strictly | ||||
|   // more powerful, but it improves diagnostics to know if the directory | ||||
|   // belongs to the expected database at all or not. | ||||
|   bytes db_uuid = 1; | ||||
| 
 | ||||
|   // A uuid associated with the directory itself. | ||||
|   bytes dir_uuid = 2; | ||||
| 
 | ||||
|   // Corresponds to an entry in the `open` database table. | ||||
|   message Open { | ||||
|     uint32 id = 1; | ||||
|     bytes uuid = 2; | ||||
|   } | ||||
| 
 | ||||
|   // The last open that was known to be recorded in the database as completed. | ||||
|   // Absent if this has never happened. Note this can backtrack in exactly one | ||||
|   // scenario: when deleting the directory, after all associated files have | ||||
|   // been deleted, last_complete_open can be moved to in_progress_open. | ||||
|   Open last_complete_open = 3; | ||||
| 
 | ||||
|   // The last run which is in progress, if different from last_complete_open. | ||||
|   // This may or may not have been recorded in the database, but it's | ||||
|   // guaranteed that no data has yet been written by this open. | ||||
|   Open in_progress_open = 4; | ||||
| } | ||||
| ``` | ||||
| 
 | ||||
| These are updated through procedures below: | ||||
| 
 | ||||
| *Write the metadata file* | ||||
| 
 | ||||
| This is a sub-procedure used in several places below. | ||||
| 
 | ||||
| Precondition: the directory's lock is held with `LOCK_EX` (exclusive). | ||||
| 
 | ||||
|   1. Write a new `meta.tmp` (opened with `O_CREAT|O_TRUNC` to discard an | ||||
|      existing temporary file if any). | ||||
|   2. `fsync` the `meta.tmp` file descriptor. | ||||
|   3. `rename` `meta.tmp` to `meta`. | ||||
|   4. `fsync` the directory. | ||||
| 
 | ||||
| *Open the database as read-only* | ||||
| 
 | ||||
|   1. Lock the database directory with `LOCK_SH` (shared). | ||||
|   2. Open the SQLite database with `SQLITE_OPEN_READ_ONLY`. | ||||
| 
 | ||||
| *Open the database as read-write* | ||||
| 
 | ||||
|   1. Lock the database directory with `LOCK_EX` (exclusive). | ||||
|   2. Open the SQLite database with `SQLITE_OPEN_READ_WRITE`. | ||||
|   3. Insert a new `open` table row with the new sequence number and uuid. | ||||
| 
 | ||||
| *Create a sample file directory* | ||||
| 
 | ||||
| Precondition: database open read-write. | ||||
| 
 | ||||
|   1. Lock the sample file directory with `LOCK_EX` (exclusive). | ||||
|   2. Verify there is no metadata file or `last_complete_open` is unset. | ||||
|   3. Write new metadata file with a fresh `dir_uuid` and a `in_progress_open` | ||||
|      matching the database's current open. | ||||
|   4. Add a matching row to the database with `last_complete_open_id` matching | ||||
|      the current open. | ||||
|   5. Update the metadata file to move `in_progress_open` to | ||||
|      `last_complete_open`. | ||||
| 
 | ||||
| *Open a sample file directory read-only* | ||||
| 
 | ||||
| Precondition: database open (read-only or read-write). | ||||
| 
 | ||||
|   1. Lock the sample file directory with `LOCK_SH` (shared). | ||||
|   2. Verify the metadata file matches the database: | ||||
|         * database uuid matches. | ||||
|         * dir uuid matches. | ||||
|         * if the database's `last_complete_open` is set, it must match the | ||||
|           directory's `last_complete_open` or `in_progress_open`. | ||||
|         * if the database's `last_complete_open` is absent, the directory's | ||||
|           must be as well. | ||||
| 
 | ||||
| *Open a sample file directory read-write* | ||||
| 
 | ||||
| Precondition: database open read-write. | ||||
| 
 | ||||
|   1. Lock the sample file directory with `LOCK_EX` (exclusive). | ||||
|   2. Verify the metadata file matches the database (as above). | ||||
|   3. Update the metadata file with `in_progress_open` matching the current | ||||
|      open. | ||||
|   3. Update the database row with `last_complete_open_id` matching the current | ||||
|      open. | ||||
|   4. Update the metadata file with `last_complete_open` rather than | ||||
|      `in_progress_open`. | ||||
|   5. Run the recording startup procedure for this directory. | ||||
| 
 | ||||
| *Close a sample file directory* | ||||
| 
 | ||||
|   1. Drop the sample file directory lock. | ||||
| 
 | ||||
| *Delete a sample file directory* | ||||
| 
 | ||||
|   1. Remove all sample files (of all three categories described below: | ||||
|      `recording` table rows, `garbage` table rows, and files with recording | ||||
|      ids >= their stream's `next_recording_id`); see "delete a recording" | ||||
|      procedure below. | ||||
|   2. Rewrite the directory metadata with `in_progress_open` set to the current open, | ||||
|      `last_complete_open` cleared. | ||||
|   3. Delete the directory's row from the database. | ||||
| 
 | ||||
| ### Lifecycle of a recording | ||||
| 
 | ||||
| Because a major part of the recording state is outside the SQL database, care | ||||
| must be taken to guarantee consistency and durability. Moonfire NVR maintains | ||||
| three invariants about sample files: | ||||
| 
 | ||||
| 1. `recording` table rows have sample files on disk | ||||
|    (named by the given UUID) with the indicated size and SHA-1 hash. | ||||
| 2. There are no sample files without a corresponding `recording` or | ||||
|    `reserved_sample_files` table row referencing their UUID. | ||||
| 3. After an orderly shutdown of Moonfire NVR, there are no | ||||
|    `reserved_sample_files` rows, even if there have been previous crashes. | ||||
|   1. `recording` table rows have sample files on disk with the indicated size | ||||
|      and SHA-1 hash. | ||||
|   2. Exactly one of the following statements is true for every sample file: | ||||
|         * It has a `recording` table row. | ||||
|         * It has a `garbage` table row. | ||||
|         * Its recording id is greater than or equal to the `next_recording_id` | ||||
|           for its stream. | ||||
|   3. After an orderly shutdown of Moonfire NVR, there is a `recording` table row | ||||
|      for every sample file, even if there have been previous crashes. | ||||
| 
 | ||||
| The first invariant provides certainty that a recording is properly stored. It | ||||
| would be prohibitively expensive to verify hashes on demand (when listing or | ||||
| serving recordings), or in some cases even to verify the size of the files via | ||||
| `stat()` calls. | ||||
| 
 | ||||
| The second invariant avoids an accidental data loss scenario. On startup, as | ||||
| part of normal crash recovery, Moonfire NVR should delete sample files which are | ||||
| half-written (and useless without their indices) and ones which were already in | ||||
| the process of being deleted (for exceeding their retention time). The absence | ||||
| of a `recording` table row could be taken to indicate one of these conditions. | ||||
| But consider another possibility: the SQLite database might not match the sample | ||||
| directory. This could happen if the wrong disk is mounted at a given path or | ||||
| after a botched restore from backup. Moonfire NVR would delete everything in | ||||
| this case! It's far safer to require a specific mention of each file to be | ||||
| deleted, requiring human intervention before touching unexpected files. | ||||
| The second invariant improves auditability of the database and sample file | ||||
| directory. | ||||
| 
 | ||||
| The third invariant prevents accumulation of garbage files which could fill the | ||||
| drive and stop recording. | ||||
| 
 | ||||
| Sample files are named by UUID. Imagine if files were named by autoincrement | ||||
| instead. One file could be mistaken for another on database vs directory | ||||
| mismatch. With UUIDs, this is impossible: by design they can be assumed to be | ||||
| universally unique, so two distinct recordings will never share a UUID. | ||||
| 
 | ||||
| These invariants are updated through the following procedure: | ||||
| 
 | ||||
| *Create a recording:* | ||||
| 
 | ||||
| 1. Insert a `reserved_sample_files` row, in state `WRITING`. | ||||
| 2. Write the sample file, aborting if `open(..., O\_WRONLY|O\_CREATE|O\_EXCL)` | ||||
|    fails with `EEXIST`. (This would indicate a non-unique UUID, a serious | ||||
|    defect.) | ||||
| 1. Write the sample file, aborting if `open(..., O\_WRONLY|O\_CREATE|O\_EXCL)` | ||||
|    fails with `EEXIST`. | ||||
| 3. `fsync()` the sample file. | ||||
| 4. `fsync()` the sample file directory. | ||||
| 5. Replace the `reserved_sample_files` row with a `recording` row, | ||||
|    marking its size and SHA-1 hash in the process. | ||||
| 5. Insert the `recording` row, marking its size and SHA-1 hash in the process. | ||||
| 
 | ||||
| *Delete a recording:* | ||||
| 
 | ||||
| 1. Replace the `recording` row with a `reserved_sample_files` row in state | ||||
|    `DELETED`. | ||||
| 1. Replace the `recording` row with a `garbage` row. | ||||
| 2. `unlink()` the sample file, warning on `ENOENT`. (This would indicate | ||||
|    invariant #2 is false.) | ||||
| 3. `fsync()` the sample file directory. | ||||
| 4. Delete the `reserved_sample_files` row. | ||||
| 4. Delete the `garbage` row. | ||||
| 
 | ||||
| *Startup (crash recovery):* | ||||
| 
 | ||||
| 1. Acquire a lock to guarantee this is the only Moonfire NVR process running | ||||
|    against the given database. This lock is not released until program shutdown. | ||||
| 2. Query `reserved_sample_files` table. | ||||
| 3. `unlink()` all the sample files associated with rows returned by #2, | ||||
|    ignoring `ENOENT`. | ||||
| 2. Query `garbage` table and `next_recording_id` field in the `stream` table. | ||||
| 3. `unlink()` all the sample files associated with garbage rows, ignoring | ||||
|    `ENOENT. | ||||
| 4. For each stream, `unlink()` all the existing files with recording ids >= | ||||
|    `next_recording_id`. | ||||
| 4. `fsync()` the samples directory. | ||||
| 5. Delete the rows returned by #2 from the `reserved_sample_files` table. | ||||
| 5. Delete all rows from the `garbage` table. | ||||
| 
 | ||||
| The procedures can be batched: while for a given recording, the steps must be | ||||
| strictly ordered, multiple recordings can be proceeding through the steps | ||||
| @ -300,15 +438,6 @@ simultaneously. In particular, there is no need to hurry syncing deletions to | ||||
| disk, so deletion steps #3 and #4 can be done opportunistically if it's | ||||
| desirable to avoid extra disk seeks or flash write cycles. | ||||
| 
 | ||||
| There could be another procedure for moving a sample file from one filesystem | ||||
| to another. This might be used when splitting cameras across hard drives. | ||||
| New states could be introduced indicating that a recording is "is moving from | ||||
| A to B" (thus, A is complete, and B is in an undefined state) or "has just | ||||
| moved from A to B" (thus, B is complete, and A may be present or not). | ||||
| Alternatively, a camera might have a search path specified for its recordings, | ||||
| such that the first directory in which a recording is found must have a | ||||
| complete copy (and subsequent directories' copies may be partial/corrupt). | ||||
| 
 | ||||
| It'd also be possible to conserve some partial recordings. Moonfire NVR could, | ||||
| as a recording is written, record the latest sample tables, | ||||
| size, and hash fields without marking the recording as fully written. On | ||||
| @ -372,6 +501,7 @@ The snippet below is a illustrative excerpt of the SQLite schema; see | ||||
|     -- A single, typically 60-second, recorded segment of video. | ||||
|     create table recording ( | ||||
|       id integer primary key, | ||||
|       open_id integer references open (id), | ||||
|       camera_id integer references camera (id) not null, | ||||
| 
 | ||||
|       sample_file_uuid blob unique not null, | ||||
|  | ||||
| @ -1,6 +1,6 @@ | ||||
| # Moonfire NVR Time Handling | ||||
| 
 | ||||
| Status: **implemented** | ||||
| Status: **current** | ||||
| 
 | ||||
| > A man with a watch knows what time it is. A man with two watches is never | ||||
| > sure. | ||||
|  | ||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user