338 Commits

Author SHA1 Message Date
Juanjo Presa
c97d0ff23d Fix fatal error on missing config file by handling viper.ConfigFileNotFoundError
Correctly identify Viper's ConfigFileNotFoundError in LoadConfig to log a warning and use defaults, unifying behavior with empty config files. Fixes fatal error when no config file is present for CLI commands relying on environment variables.
2025-10-19 15:29:47 +02:00
Florian Preinstorfer
46477b8021 Downgrade completed broadcast message to debug 2025-10-18 07:56:59 +02:00
Stavros Kois
c07cc491bf
add health command (#2659)
* add health command
* update health check implementation to allow for more checks to added over time
* add change changelog entry
2025-10-16 12:00:11 +00:00
Vitalij Dovhanyc
c2a58a304d
feat: add autogroup:self (#2789) 2025-10-16 12:59:52 +02:00
Kristoffer Dalby
fddc7117e4
stability and race conditions in auth and node store (#2781)
This PR addresses some consistency issues that was introduced or discovered with the nodestore.

nodestore:
Now returns the node that is being put or updated when it is finished. This closes a race condition where when we read it back, we do not necessarily get the node with the given change and it ensures we get all the other updates from that batch write.

auth:
Authentication paths have been unified and simplified. It removes a lot of bad branches and ensures we only do the minimal work.
A comprehensive auth test set has been created so we do not have to run integration tests to validate auth and it has allowed us to generate test cases for all the branches we currently know of.

integration:
added a lot more tooling and checks to validate that nodes reach the expected state when they come up and down. Standardised between the different auth models. A lot of this is to support or detect issues in the changes to nodestore (races) and auth (inconsistencies after login and reaching correct state)

This PR was assisted, particularly tests, by claude code.
2025-10-16 12:17:43 +02:00
Andrey Bobelev
c4a8c038cd fix: return valid AuthUrl in followup request on expired reg id
- tailscale client gets a new AuthUrl and sets entry in the regcache
- regcache entry expires
- client doesn't know about that
- client always polls followup request а gets error

When user clicks "Login" in the app (after cache expiry), they visit
invalid URL and get "node not found in registration cache". Some clients
on Windows for e.g. can't get a new AuthUrl without restart the app.

To fix that we can issue a new reg id and return user a new valid
AuthUrl.

RegisterNode is refactored to be created with NewRegisterNode() to
autocreate channel and other stuff.
2025-10-11 05:57:39 +02:00
Andrey Bobelev
022098fe4e chore: make reg cache expiry tunable
Mostly for the tests, opts:

- tuning.register_cache_expiration
- tuning.register_cache_cleanup
2025-10-11 05:57:39 +02:00
Kristoffer Dalby
ed3a9c8d6d
mapper: send change instead of full update (#2775) 2025-09-17 14:23:21 +02:00
Kristoffer Dalby
2b30a15a68
cmd: add option to get and set policy directly from database (#2765) 2025-09-12 16:55:15 +02:00
Kristoffer Dalby
2938d03878
policy: reject unsupported fields (#2764) 2025-09-12 14:47:56 +02:00
Kristoffer Dalby
1b1c989268
{policy, node}: allow return paths in route reduction (#2767) 2025-09-12 11:47:51 +02:00
Kristoffer Dalby
3950f8f171
cli: use gobuild version handling (#2770) 2025-09-12 11:47:31 +02:00
Kristoffer Dalby
ee0ef396a2
policy: fix ssh usermap, fixing autogroup:nonroot (#2768) 2025-09-12 09:12:30 +02:00
Kristoffer Dalby
7056fbb63b
derp: fix flaky shuffle test (#2772) 2025-09-11 13:49:02 +00:00
Kristoffer Dalby
c91b9fc761
poll: add missing godoc (#2763) 2025-09-11 14:15:19 +02:00
Kristoffer Dalby
d41fb4d540 app: fix sigint hanging
When the node notifier was replaced with batcher, we removed
its closing, but forgot to add the batchers so it was never
stopping node connections and waiting forever.

Fixes #2751

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-11 11:53:26 +02:00
Kristoffer Dalby
01c1f6f82a
policy: validate error message for asterix in ssh (#2766) 2025-09-10 18:41:43 +02:00
Kristoffer Dalby
476f30ab20 state: ensure netinfo is preserved and not removed
the client will send a lot of fields as `nil` if they have
not changed. NetInfo, which is inside Hostinfo, is one of those
fields and we often would override the whole hostinfo meaning that
we would remove netinfo if it hadnt changed.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
233dffc186 lint and leftover
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
81b3e8f743 util: harden parsing of traceroute
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
50ed24847b debug: add json and improve
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
3b16b75fe6 integration: rework retry for waiting for node sync
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
9d236571f4 state/nodestore: in memory representation of nodes
Initial work on a nodestore which stores all of the nodes
and their relations in memory with relationship for peers
precalculated.

It is a copy-on-write structure, replacing the "snapshot"
when a change to the structure occurs. It is optimised for reads,
and while batches are not fast, they are grouped together
to do less of the expensive peer calculation if there are many
changes rapidly.

Writes will block until commited, while reads are never
blocked.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
38be30b6d4 derp: allow override to ip for debug
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
b6d5788231 mapper: produce map before poll
Before this patch, we would send a message to each "node stream"
that there is an update that needs to be turned into a mapresponse
and sent to a node.

Producing the mapresponse is a "costly" afair which means that while
a node was producing one, it might start blocking and creating full
queues from the poller and all the way up to where updates where sent.

This could cause updates to time out and being dropped as a bad node
going away or spending too time processing would cause all the other
nodes to not get any updates.

In addition, it contributed to "uncontrolled parallel processing" by
potentially doing too many expensive operations at the same time:

Each node stream is essentially a channel, meaning that if you have 30
nodes, we will try to process 30 map requests at the same time. If you
have 8 cpu cores, that will saturate all the cores immediately and cause
a lot of wasted switching between the processing.

Now, all the maps are processed by workers in the mapper, and the number
of workers are controlable. These would now be recommended to be a bit
less than number of CPU cores, allowing us to process them as fast as we
can, and then send them to the poll.

When the poll recieved the map, it is only responsible for taking it and
sending it to the node.

This might not directly improve the performance of Headscale, but it will
likely make the performance a lot more consistent. And I would argue the
design is a lot easier to reason about.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-09-09 09:40:00 +02:00
Kristoffer Dalby
8e25f7f9dd
bunch of qol (#2748) 2025-08-27 17:09:13 +02:00
cuiweixie
a2a6d20218 Refactor to use reflect.TypeFor 2025-08-23 20:43:49 +02:00
Andrey Bobelev
d29feaef79 chore(derp): allow nil regions in DERPMaps
Previously, nil regions were not properly handled. This change allows users to disable regions in DERPMaps.

Particularly useful to disable some official regions.
2025-08-23 06:54:14 +02:00
Andrey Bobelev
630bfd265a chore(derp): prioritize loading DERP maps from URLs
This allows users to override default entries provided via URL
2025-08-23 06:54:14 +02:00
Kristoffer Dalby
b87567628a
derp: increase update frequency and harden on failures (#2741) 2025-08-22 10:40:38 +02:00
Florian Preinstorfer
be337c6a33 Enable derp.server.verify_clients by default
This setting is already enabled in example-config.yaml but would default
to false if no key is set.
2025-08-19 11:30:44 +02:00
Shourya Gautam
086fcad7d9
Fix Internal server error on /verify (#2735)
* converted the returned error to an httpError
2025-08-18 14:39:42 +00:00
afranco
43f90d205e fix: allow all traffic if acls field is omited from the policy 2025-08-18 16:13:14 +02:00
Florian Preinstorfer
30a1f7e68e Log registrationID to simplify interactive node registration
Some clients such as Android make it hard to transfer the registrationID
to the server, its easier to get it from the server logs.
2025-08-15 17:11:38 +02:00
Fredrik Ekre
5d8a2c25ea OIDC: Query userinfo endpoint before verifying user
This patch includes some changes to the OIDC integration in particular:
 - Make sure that userinfo claims are queried *before* comparing the
   user with the configured allowed groups, email and email domain.
 - Update user with group claim from the userinfo endpoint which is
   required for allowed groups to work correctly. This is essentially a
   continuation of #2545.
 - Let userinfo claims take precedence over id token claims.

With these changes I have verified that Headscale works as expected
together with Authelia without the documented escape hatch [0], i.e.
everything works even if the id token only contain the iss and sub
claims.

[0]: https://www.authelia.com/integration/openid-connect/headscale/#configuration-escape-hatch
2025-08-11 17:51:16 +02:00
eyjhb
d77874373d feat: add robots.txt 2025-08-10 10:57:45 +02:00
Kristoffer Dalby
a058bf3cd3
mapper: produce map before poll (#2628) 2025-07-28 11:15:53 +02:00
Kian-Meng Ang
3123d5286b Fix typos
Found via `codespell -L shs,hastable,userr`
2025-07-21 12:06:07 +02:00
Kristoffer Dalby
c6d7b512bd
integration: replace time.Sleep with assert.EventuallyWithT (#2680) 2025-07-10 23:38:55 +02:00
Kristoffer Dalby
b904276f2b poll: use nodeview everywhere
There was a bug in HA subnet router handover where we used stale node data
from the longpoll session that we handed to Connect. This meant that we got
some odd behaviour where routes would not be deactivated correctly.

This commit changes to the nodeview is used through out, and we load the
current node to be updated in the write path and then handle it all there
to be consistent.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-07-08 21:05:15 +02:00
Kristoffer Dalby
73023c2ec3 all: use immutable node view in read path
This commit changes most of our (*)types.Node to
types.NodeView, which is a readonly version of the
underlying node ensuring that there is no mutations
happening in the read path.

Based on the migration, there didnt seem to be any, but the
idea here is to prevent it in the future and simplify other
new implementations.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-07-07 21:28:59 +01:00
Kristoffer Dalby
c6736dd6d6 db: add sqlite "source of truth" schema
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-07-07 15:48:38 +01:00
Stavros Kois
855c48aec2
remove unneeded check (#2658) 2025-07-04 15:47:01 +00:00
Stavros Kois
ded049b905
don't crash if config file is missing (#2656) 2025-07-04 12:58:17 +00:00
eyJhb
efc6974017
fix typo in parseCapabilityVersion, and removed unused error (#2644) (#2644) 2025-07-04 09:40:29 +02:00
Fredrik Ekre
3f72ee9de8
Clarify SIGHUP log message (#2661) 2025-07-04 09:30:51 +02:00
nblock
e73b2a9fb9
Ensure that a username starts with a letter (#2635) 2025-06-24 14:45:44 +02:00
Kristoffer Dalby
1553f0ab53 state: introduce state
this commit moves all of the read and write logic, and all different parts
of headscale that manages some sort of persistent and in memory state into
a separate package.

The goal of this is to clearly define the boundry between parts of the app
which accesses and modifies data, and where it happens. Previously, different
state (routes, policy, db and so on) was used directly, and sometime passed to
functions as pointers.

Now all access has to go through state. In the initial implementation,
most of the same functions exists and have just been moved. In the future
centralising this will allow us to optimise bottle necks with the database
(in memory state) and make the different parts talking to eachother do so
in the same way across headscale components.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
2025-06-24 07:58:54 +02:00
Kristoffer Dalby
a975b6a8b1
hscontrol: remove go-grpc-middleware v1 dependency (#2653)
Co-authored-by: Claude <noreply@anthropic.com>
2025-06-23 16:57:20 +02:00
Kristoffer Dalby
afc11e1f0c
cmd/hi: fixes and qol (#2649) 2025-06-23 13:43:14 +02:00