Add generic N8N_GRACEFUL_SHUTDOWN_TIMEOUT which controls how long n8n
process will wait for graceful exit before exitting forcefully. This
variables replaces the QUEUE_WORKER_TIMEOUT variable that was used for
worker process.
DEPRECATED: QUEUE_WORKER_TIMEOUT deprected
QUEUE_WORKER_TIMEOUT environment variable has been replaced with
N8N_GRACEFUL_SHUTDOWN_TIMEOUT.
If the process doesn't shutdown within a time limit, exit with error
code.
1. conceptually something timing out is an error.
2. on successful exit we close down the DB connection gracefully. On an
exit timeout we rather not do that, since it will wait for any active
connections to close and would possible block the exit.
We added ID-less workflow reporting at #8031, which has already produced
multiple reports coming from internal, enough info to tackle [this
story](https://linear.app/n8n/issue/PAY-1147). To prevent an
overwhelming number of reports from cloud, this PR removes the reporting
for now.
We're initializing the queue twice because of a [bad
merge](2c63474538).
No associated known bugs but no need to init the queue twice. We should
follow up by investigating if any pending bugs can be associated to
this.
Ensure all errors in `cli` are `ApplicationError` or children of it and
contain no variables in the message, to continue normalizing all the
errors we report to Sentry
Follow-up to: https://github.com/n8n-io/n8n/pull/7839
Ensure all errors in `cli` inherit from `ApplicationError` to continue
normalizing all the errors we report to Sentry
Follow-up to: https://github.com/n8n-io/n8n/pull/7820
This change ensures that things like `encryptionKey` and `instanceId`
are always available directly where they are needed, instead of passing
them around throughout the code.
In a rare edge case an undefined queue could be returned - this should
not happen and now an error is thrown.
Also using the opportunity to remove a cyclic dependency from the Queue.
all commands sent between main instance and workers need to contain a
server id to prevent senders from reacting to their own messages,
causing loops
this PR makes sure all sent messages contain a sender id by default as
part of constructing a sending redis client.
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
Depends on: #7092 | Story:
[PAY-768](https://linear.app/n8n/issue/PAY-768)
This PR:
- Generalizes the `IBinaryDataManager` interface.
- Adjusts `Filesystem.ts` to satisfy the interface.
- Sets up an S3 client stub to be filled in in the next PR.
- Turns `BinaryDataManager` into an injectable service.
- Adjusts the config schema and adds new validators.
Note that the PR looks large but all the main changes are in
`packages/core/src/binaryData`.
Out of scope:
- `BinaryDataManager` (now `BinaryDataService`) and `Filesystem.ts` (now
`fs.client.ts`) were slightly refactored for maintainability, but fully
overhauling them is **not** the focus of this PR, which is meant to
clear the way for the S3 implementation. Future improvements for these
two should include setting up a backwards-compatible dir structure that
makes it easier to locate binary data files to delete, removing
duplication, simplifying cloning methods, using integers for binary data
size instead of `prettyBytes()`, writing tests for existing binary data
logic, etc.
---------
Co-authored-by: कारतोफ्फेलस्क्रिप्ट™ <aditya@netroy.in>
This PR implements the updated license SDK so that worker and webhook
instances do not auto-renew licenses any more.
Instead, they receive a `reloadLicense` command via the Redis client
that will fetch the updated license after it was saved on the main
instance
This also contains some refactoring with moving redis sub and pub
clients into the event bus directly, to prevent cyclic dependency
issues.
# Motivation
In Queue mode, finished executions would cause the main instance to
always pull all execution data from the database, unflatten it and then
use it to send out event log events and telemetry events, as well as
required returns to Respond to Webhook nodes etc.
This could cause OOM errors when the data was large, since it had to be
fully unpacked and transformed on the main instance’s side, using up a
lot of memory (and time).
This PR attempts to limit this behaviour to only happen in those
required cases where the data has to be forwarded to some waiting
webhook, for example.
# Changes
Execution data is only required in cases, where the active execution has
a `postExecutePromise` attached to it. These usually forward the data to
some other endpoint (e.g. a listening webhook connection).
By adding a helper `getPostExecutePromiseCount()`, we can decide that in
cases where there is nothing listening at all, there is no reason to
pull the data on the main instance.
Previously, there would always be postExecutePromises because the
telemetry events were called. Now, these have been moved into the
workers, which have been given the various InternalHooks calls to their
hook function arrays, so they themselves issue these telemetry and event
calls.
This results in all event log messages to now be logged on the worker’s
event log, as well as the worker’s eventbus being the one to send out
the events to destinations. The main event log does…pretty much nothing.
We are not logging executions on the main event log any more, because
this would require all events to be replicated 1:1 from the workers to
the main instance(s) (this IS possible and implemented, see the worker’s
`replicateToRedisEventLogFunction` - but it is not enabled to reduce the
amount of traffic over redis).
Partial events in the main log could confuse the recovery process and
would result in, ironically, the recovery corrupting the execution data
by considering them crashed.
# Refactor
I have also used the opportunity to reduce duplicate code and move some
of the hook functionality into
`packages/cli/src/executionLifecycleHooks/shared/sharedHookFunctions.ts`
in preparation for a future full refactor of the hooks
This PR adds new endpoints to the REST API:
`/orchestration/worker/status` and `/orchestration/worker/id`
Currently these just trigger the return of status / ids from the workers
via the redis back channel, this still needs to be handled and passed
through to the frontend.
It also adds the eventbus to each worker, and triggers a reload of those
eventbus instances when the configuration changes on the main instances.