Deep dive into advanced NSO concepts.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Select the optimal CDB persistence mode for your use case.
The Configuration Database (CDB) is a built-in datastore for NSO, specifically designed for network automation use cases and backed by the YANG schema. Since NSO 6.4, the CDB can be configured to operate in one of the two distinct modes: in-memory-v1
and on-demand-v1
.
The in-memory-v1
mode keeps all the configuration data in RAM for the fastest access time. New data is persisted to disk in the form of journal (WAL) files, which the system uses on every restart to reconstruct the RAM database. But the amount of RAM needed is proportional to the number of managed devices and services. When NSO is used to manage a large network, the amount of needed RAM can be quite large. This is the only CDB persistence mode available before NSO 6.4.
The on-demand-v1
mode loads data on demand from the disk into the RAM and supports offloading the least-used data to free up memory. Loading only the compiled YANG schema initially (in the form of .fxs files), results in faster system startup times. This mode was first introduced in NSO 6.4.
For reliable storage of the configuration on disk, regardless of the persistence mode, the CDB requires that the file system correctly implements the standard primitives for file synchronization and truncation. For this reason (as well as for performance), NFS or other network file systems are unsuitable for use with the CDB - they may be acceptable for development, but using them in production is unsupported and strongly discouraged.
Compared to in-memory-v1
, on-demand-v1
mode has a number of benefits:
Faster startup time: Data is not loaded into memory at startup, only schema is.
Lower memory requirements: Data is loaded into memory only when needed and offloaded when not.
Faster sync of high-availability nodes: Only subscribed data on the followers is loaded at once.
Background compaction: Compaction process no longer locks the CDB, allowing writes to proceed uninterrupted.
While the on-demand-v1
mode is as fast for reads of "hot" data (already in memory) as the in-memory-v1
mode, reads are slower for "cold" data (not loaded in memory), since the data first has to be read from disk. In turn, this results in a bigger variance in the time that a read takes in the on-demand-v1
mode, based on whether the data is already available in RAM or not. The variance could express in different ways, for example taking a longer time to produce the service mapping or creating a rollback for the first request. To lessen the effect, we highly recommend fast storage, such as NVMe flash drives.
Furthermore, the two modes differ in the way they internally organize and store data, resulting in different performance characteristics. If sufficient RAM is available, in some cases in-memory-v1
performs better, while in others, on-demand-v1
performs better. One known case where the performance of on-demand-v1
does not reach that of in-memory-v1
is deleting large trees of data. But in general, only extensive testing of the specific use case can tell which mode performs better.
As a rule of thumb, we recommend the on-demand-v1
mode as it has typical performance comparable to in-memory-v1
but has better maintainability properties. However, if performance requirements and testing favor the in-memory-v1
mode, that may be a viable choice. Discounting the migration time, you can easily switch between the two modes with automatic migration at system startup.
The CDB persistence is configured under /ncs-config/cdb/persistence
in the ncs.conf
file. The format
leaf selects the desired persistence mode, either on-demand-v1
or in-memory-v1
(default is in-memory-v1
), and the system automatically migrates the data on the next start if needed. Note that the system will not be available for the migration duration.
With the on-demand-v1
mode, additional offloading configuration under offload
container becomes relevant (in-memory-v1
keeps all data in RAM and does not perform any offloading). The offload/interval
specifies how often the system checks its memory consumption and starts the offload process if required.
During the offloading process, data is evicted from memory:
If the piece of data was last accessed more than offload/threshold/max-age
ago (the default value of infinity disables this check).
The least-recently-used items are evicted until their usage drops below the allowed amount.
The allowed amount is defined either by the absolute value offload/threshold/megabytes
or by offload/threshold/system-memory-percentage
, where the value is calculated dynamically based on the available system RAM. We recommend using the latter unless testing has shown specific requirements.
The actual value should be adjusted according to the use case and system requirements; there is no single optimal setting for all cases. We recommend you start with defaults and then adjust according to observations. You can enable the new /ncs-config/cdb/persistence/db-statistics
property to aid you in this task; the counters and gauges are available under /ncs:metric/sysadmin/*/cdb
.
For durability, improved performance, and snapshot isolation, CDB writes in NSO use data structures, such as a write-ahead log (WAL), that require periodic compaction.
For example, the in-memory-v1
persistence mode appends a new log entry for each CDB transaction to the target datastore WAL file (A.cdb
for configuration, O.cdb
for operational, and S.cdb
for snapshot datastore). Depending on the size and number of transactions towards the system, these files will grow in size leading to increased disk utilization, longer boot times, and longer initial data synchronization time when setting up a high-availability cluster using this persistence mode.
Compaction is a mechanism used to reduce the size of the write-ahead logs to a minimum. In on-demand-v1
mode, it is automatic, non-configurable, and runs in the background without affecting the ongoing transactions.
But in in-memory-v1
mode, it works by replacing an existing write-ahead log, which is composed of a number of consecutive transaction logs created in run-time, with a single transaction log representing the full current state of the datastore. From this perspective, a compaction acts similarly to a write transaction towards a datastore. To ensure data integrity, 'write' transactions towards the datastore are not permitted during the time compaction takes place. For this reason, NSO exposes a number of settings to control the compaction process in in-memory-v1
mode (these have no effect for on-demand-v1
).
By default, compaction is handled automatically by the CDB. After each transaction, CDB evaluates whether compaction is required for the affected datastore.
This is done by examining the number of added nodes as well as the file size changes since the last performed compaction. The thresholds used can be modified in the ncs.conf
file by configuring the /ncs-config/compaction/file-size-relative
, /ncs-config/compaction/file-size-absolute
, and /ncs-config/compaction/num-node-relative
settings.
It is also possible to automatically trigger compaction after a set number of transactions by setting the /ncs-config/compaction/num-transaction
property.
In the configuration datastore, compaction is by default delayed by 5 seconds when the threshold is reached to prevent any upcoming write transaction from being blocked. If the system is idle during these 5 seconds, meaning that there is no new transaction, the compaction will initiate. Otherwise, compaction is delayed by another 5 seconds. The delay time can be configured in ncs.conf
by setting the /ncs-config/compaction/delayed-compaction-timeout
property.
As compaction may require a significant amount of time, it may be preferable to disable automatic compaction by CDB and instead trigger compaction manually according to specific needs. If doing so, it is highly recommended to have another automated system in place. Automation of compaction can be done by using a scheduling mechanism such as CRON, or by using the NCS scheduler. See Scheduler for more information.
By default, CDB may perform compaction during its boot process. This may be disabled if required, by starting NSO with the flag --disable-compaction-on-start
.
Additionally, CDB CAPI provides a set of functions that may be used to create an external mechanism for compaction. See cdb_initiate_journal_compaction()
, cdb_initiate_journal_dbfile_compaction()
, and cdb_get_compaction_info()
in confd_lib_cdb(3) in Manual Pages.
Learn about different transaction locks in NSO and their interactions.
This section explains the different locks that exist in NSO and how they interact. It is important to understand the architecture of NSO with its management backplane, and the transaction state machine as described in Package Development to be able to understand how the different locks fit into the picture.
The NSO management backplane keeps a lock on the datastore running. This lock is usually referred to as the global lock and it provides a mechanism to grant exclusive access to the datastore.
The global is the only lock that can explicitly be taken through a northbound agent, for example by the NETCONF <lock>
operation, or by calling Maapi.lock()
.
A global lock can be taken for the whole datastore, or it can be a partial lock (for a subset of the data model). Partial locks are exposed through NETCONF and MAAPI and are only supported for operations toward the running datastore.
An agent can request a global lock to ensure that it has exclusive write access. When a global lock is held by an agent, it is not possible for anyone else to write to the datastore that the lock guards - this is enforced by the transaction engine. A global lock on running is granted to an agent if there are no other holders of it (including partial locks) and if all data providers approve the lock request. Each data provider (CDB and/or external data providers) will have its lock()
callback invoked to get a chance to refuse or accept the lock. The output of ncs --status
includes locking status. For each user session locks (if any) per datastore is listed.
A northbound agent starts a user session towards NSO's management backplane. Each user session can then start multiple transactions. A transaction is either read/write or read-only.
The transaction engine has its internal locks towards the running datastore. These transaction locks exist to serialize configuration updates towards the datastore and are separate from the global locks.
As a northbound agent wants to update the running datastore with a new configuration, it will implicitly grab and release the transactional lock. The transaction engine takes care of managing the locks, as it moves through the transaction state machine and there is no API that exposes the transactional locks to the northbound agents.
When the transaction engine wants to take a lock for a transaction (for example when entering the validate state), it first checks that no other transaction has the lock. Then it checks that no user session has a global lock on that datastore. Finally, each data provider is invoked by its transLock()
callback.
In contrast to the implicit transactional locks, some northbound agents expose explicit access to the global locks. This is done a bit differently by each agent.
The management API exposes the global locks by providing Maapi.lock()
and Maapi.unlock()
methods (and the corresponding Maapi.lockPartial()
Maapi.unlockPartial()
for partial locking). Once a user session is established (or attached to) these functions can be called.
In the CLI, the global locks are taken when entering different configure modes as follows:
config exclusive
: The running datastore global lock will be taken.
config terminal
: Does not grab any locks.
The global lock is then kept by the CLI until the configure mode is exited.
The Web UI behaves in the same way as the CLI (it presents three edit tabs called Edit private, Edit exclusive, and which correspond to the CLI modes described above).
The NETCONF agent translates the <lock>
operation into a request for the global lock for the requested datastore. Partial locks are also exposed through the partial-lock RPC.
Implementing the lock()
and unlock()
callbacks is not required of an external data provider. NSO will never try to initiate the transLock()
state transition (see the transaction state diagram in Package Development) towards a data provider while a global lock is taken - so the reason for a data provider to implement the locking callbacks is if someone else can write (or lock for example to take a backup) to the data providers database.
CDB ignores the lock()
and unlock()
callbacks (since the data-provider interface is the only write interface towards it).
CDB has its own internal locks on the database. The running datastore has a single write and multiple read locks. It is not possible to grab the write-lock on a datastore while there are active read-locks on it. The locks in CDB exist to make sure that a reader always gets a consistent view of the data (in particular it becomes very confusing if another user is able to delete configuration nodes in between calls to getNext() on YANG list entries).
During a transaction transLock()
takes a CDB read-lock towards the transactions datastore and writeStart() tries to release the read-lock and grab the write-lock instead.
A CDB external reader client implicitly takes a CDB read-lock between Cdb.startSession()
and Cdb.endSession()
This means that while a CDB client is reading, a transaction can not pass through writeStart()
(and conversely, a CDB reader can not start while a transaction is in between writeStart()
and commit()
or abort()
).
The Operational store in CDB does not have any locks. NSO's transaction engine can only read from it, and the CDB client writes are atomic per write operation.
When a session tries to modify a data store that is locked in some way, it will fail. For example, the CLI might print:
Since some of the locks are short-lived (such as a CDB read-lock), NSO is by default, configured to retry the failing operation for a short period of time. If the data store still is locked after this time, the operation fails.
To configure this, set /ncs-config/commit-retry-timeout
in ncs.conf
.
Restart strategy for the service manager.
The service manager executes in a Java VM outside of NSO. The NcsMux
initializes a number of sockets to NSO at startup. These are Maapi sockets and data provider sockets. NSO can choose to close any of these sockets whenever NSO requests the service manager to perform a task, and that task is not finished within the stipulated timeout. If that happens, the service manager must be restarted. The timeout(s) are controlled by several ncs.conf
parameters found under /ncs-config/japi
.
Store strings in NSO that are encrypted and decrypted using cryptographic keys.
By using the NSO built-in encrypted YANG extension types tailf:des3-cbc-encrypted-string
, tailf:aes-cfb-128-encrypted-string
, or tailf:aes-256-cfb-128-encrypted-string
, it is possible to store encrypted string values in NSO that can be decrypted. See the tailf_yang_extensions(5) man page for more details on the encrypted string YANG extension types.
NSO supports defining one or more sets of cryptographic keys directly in ncs.conf
and in an external file read by an external command. Three methods can be used to configure the keys in ncs.conf
:
External keys under /ncs-config/encrypted-strings/external-keys
.
Key rotation under /ncs-config/encrypted-strings/key-rotation
.
Legacy (single generation) format: /ncs-config/encrypted-strings/DES3CBC
, /ncs-config/encrypted-strings/AESCFB128
, and /ncs-config/encrypted-strings/AES256CFB128
.
Local installation: Dummy keys are provided in legacy format in ncs.conf
for development purposes. For deployment, the keys must be changed to random values. Example local installation ncs.conf
(do not reuse):
System installation: Random keys are generated in the legacy format stored in ${NCS_CONFIG_DIR}/ncs.crypto_keys
, and read using the ${NCS_DIR}/bin/ncs_crypto_keys
external command as configured in ${NCS_CONFIG_DIR}/ncs.conf
. Example system installation ncs.conf:
Example system installationncs.crypto_keys
file (do not reuse):
For details on using a custom external command to read the encryption keys, see Encrypted Strings.
To provide keys that can be rotated in ncs.conf
, each generation of cryptographic keys must be encapsulated in a /ncs-config/encrypted-strings/key-rotation
list, and a /ncs-config/encrypted-strings/key-rotation/generation
list key starting from 0
must be included and incremented for each set of cryptographic keys. Example (do not reuse):
External keys that can be rotated must be provided with the initial line EXTERNAL_KEY_FORMAT=2
and the generation
within square brackets. Example (do not reuse):
There is always an active generation:
Active generation is the generation in the set of keys currently used to encrypt and decrypt all leafs with an encrypted string type.
The active generation is persisted.
If using the legacy method of providing keys in ncs.conf
or when providing keys using the /ncs-config/encrypted-strings/key-rotation
method without providing the initial line EXTERNAL_KEY_FORMAT=2
in the application, the active generation will be -1
.
If starting NSO without any previous keys using the /ncs-config/encrypted-strings/key-rotation
method or the external-keys
method with the initial line EXTERNAL_KEY_FORMAT=2
, the highest provided generation will be selected as the active generation.
For ncs.conf
details, see the ncs.conf(5) man page under /ncs-config/encrypted-strings
.
Rotating cryptographic keys means replacing an old cryptographic key with a new one while maintaining the functionality of the encryption and decryption of encrypted string values in NSO. It is a standard practice in cryptography and key management to enhance security and mitigate risks associated with key exposure or compromise. Key rotation helps ensure that sensitive data remains secure over time. It reduces the impact of potential key compromise and adheres to best practices for cryptographic hygiene. Key benefits:
If a cryptographic key is compromised, rotating it reduces the amount of data exposed to the attacker since previously encrypted values can be re-encrypted with a new key.
Regular rotation minimizes the time a single key is in use, thereby reducing the potential damage an attacker could do if they gain access to it.
Reusing the same key for a prolonged period increases the risk of data correlation attacks (e.g., frequency analysis). Rotation ensures unique keys are used for encrypting strings, reducing this risk.
Regularly rotating keys helps organizations maintain and test their key management processes. This ensures the system is prepared to handle key management tasks effectively in an emergency.
To rotate to a new generation of keys and re-encrypt the data:
Always take a backup using ncs-backup.
Check the currently active generation using the /key-rotation/get-active-generation
action.
Re-encrypt all encrypted values with a new set of keys using the /key-rotation/apply-new-key
action with the new-key-generation
to rotate to as input.
The commit queue must be empty before running the action, or the action will fail, as the snapshot database is re-initialized. To wait for the commit queue to become empty, use the wait-commit-queue
argument with the number of seconds to wait before failing.
CLI example:
The data in CDB that is subject to re-encryption when executing the /key-rotation/apply-new-key
action:
Encrypted types.
Unions of encrypted types.
Service metadata (original attribute, reverse and forward diff set).
NED secrets.
Rollback files.
History log.
Under the hood, the/key-rotation/apply-new-keys
action, when executed, performs the following steps:
Starts an upgrade transaction that will be used when re-encrypting the datastore.
Load the new active cryptographic keys into CDB and persist them.
Sync HA.
Re-encrypt data.
Drops the CDB snapshot database.
Commits data.
Restart NSO VMs.
End upgrade.
Before changing the cryptographic keys, always take a backup using ncs-backup. Also, backup the external key file, default ${NCS_CONFIG_DIR}/ncs.crypto_keys
, or the ${NCS_CONFIG_DIR}/ncs.conf
file, depending on where the keys are stored.
Suppose you have previously provided keys in the legacy format and wish to switch to /ncs-config/encrypted-strings/key-rotation
or external-keys
with the initial line EXTERNAL_KEY_FORMAT=2
. In that case, you must provide the currently used keys as generation -1
. The new keys can have any non-negative generation number.
Replace the external key file or ncs.conf
file depending on where the keys are stored.
Issue ncs --reload
to reload the cryptographic keys.
Ensure commit queues are empty or wait for them to become empty.
Execute/key-rotation/apply-new-keys
action to change the active generation, for example, from -1
to new-key-generation 0
as shown in the CLI example above.
In a high-availability setting, keys must be identical on all nodes before attempting key rotation. Otherwise, the action will abort. The node executing the action will initiate the key reload for all nodes.
Learn about using IPv6 on NSO's northbound interfaces.
NSO supports access to all northbound interfaces via IPv6, and in the most simple case, i.e. IPv6-only access, this is just a matter of configuring an IPv6 address (typically the wildcard address ::
) instead of IPv4 for the respective agents and transports in ncs.conf
, e.g. /ncs-config/cli/ssh/ip
for SSH connections to the CLI, or /ncs-config/netconf-north-bound/transport/ssh/ip
for SSH to the NETCONF agent. The SNMP agent configuration is configured via one of the other northbound interfaces rather than via ncs.conf
, see NSO SNMP Agent in Northbound APIs. For example, via the CLI, we would set snmp agent ip
to the desired address. All these addresses default to the IPv4 wildcard address 0.0.0.0
.
In most IPv6 deployments, it will however be necessary to support IPv6 and IPv4 access simultaneously. This requires that both IPv4 and IPv6 addresses are configured, typically 0.0.0.0
plus ::
. To support this, there is in addition to the ip
and port
leafs also a list extra-listen
for each agent and transport, where additional IP addresses and port pairs can be configured. Thus, to configure the CLI to accept SSH connections to port 2024 on any local IPv6 address, in addition to the default (port 2024 on any local IPv4 address), we can add an <extra-listen>
section under /ncs-config/cli/ssh
in ncs.conf
:
To configure the SNMP agent to accept requests to port 161 on any local IPv6 address, we could similarly use the CLI and give the command:
The extra-listen
list can take any number of address/port pairs, thus this method can also be used when we want to accept connections/requests on several specified (IPv4 and/or IPv6) addresses instead of the wildcard address, or we want to use multiple ports.
Run NSO as non-root user.
A common misfeature found on UNIX operating systems is the restriction that only root
can bind to ports below 1024. Many a dollar has been wasted on workarounds and often the results are security holes.
Both FreeBSD and Solaris have elegant configuration options to turn this feature off. On FreeBSD:
The above is best added to your /etc/sysctl.conf
.
Similarly, on Solaris, we can just configure this. Assuming we want to run NSO under a non-root user ncs
. On Solaris, we can do that easily by granting the specific right to bind privileged ports below 1024 (and only that) to the ncs
user using:
And check that we get what we want through:
Linux doesn't have anything like the above. There are a couple of options on Linux. The best is to use an auxiliary program like authbind
(http://packages.debian.org/stable/authbind
) or privbind
(http://sourceforge.net/projects/privbind/
).
These programs are run by root
. To start NCS under e.g., privbind
, we can do:
The above command starts NSO as the user ncs
and binds to ports below 1024.
Connect client libraries to NSO with IPC.
Client libraries connect to NSO for inter-process communication (IPC) using TCP or Unix domain sockets.
If NSO is configured to use TCP sockets for IPC, you can tell NSO which address to use for these connections through the /ncs-config/ncs-ipc-address/ip
(default value 127.0.0.1) and /ncs-config/ncs-ipc-address/port
(default value 4569) elements in ncs.conf
. If you change these values, you will likely need to configure the clients accordingly. Note that these values have security implications, see Security Issues. In particular, changing the address away from 127.0.0.1 may allow unauthenticated remote connections.
Many of the clients read the environment variables NCS_IPC_ADDR
and NCS_IPC_PORT
to determine if something other than the default is to be used, but others might need source code changes. This is a list of clients that communicate with NSO, and what needs to be done when ncs-ipc-address
is changed.
Remote commands via the ncs
command
Remote commands, such as ncs --reload
, check the environment variables NCS_IPC_ADDR
and NCS_IPC_PORT
.
CLI tools
The Command Line Interface (CLI) client ncs_cli and similar commands, such as ncs_cmd and ncs_load, check the environment variables `NCS_IPC_ADDR` and `NCS_IPC_PORT`. Alternatively, many of them also support command line options.
CDB and MAAPI clients
The address supplied to Cdb.connect()
and Maapi.connect()
must be changed.
Data provider API clients
The address supplied to Dp
constructor socket must be changed.
Notification API clients
The new address must be supplied to the socket for the Nofif
constructor.
Likewise, if NSO is configured to use Unix domain sockets for IPC and you have changed the path under /ncs-config/ncs-local-ipc/path
in ncs.conf
, you can tell clients to use the new path through the NCS_IPC_PATH
environment variable. Clients must also have filesystem permission to access the IPC path or they will not be able to communicate with the NSO daemon process.
To run more than one instance of NSO on the same host (which can be useful in development scenarios), each instance needs its own IPC socket. If using TCP for IPC, set /ncs-config/ncs-ipc-address/port
in ncs.conf
to different values for each instance. If, instead, you are using Unix sockets for IPC, set /ncs-config/ncs-local-ipc/path
in ncs.conf
to different values. In either case, you may also need to change the NETCONF and CLI over SSH ports under /ncs-config/netconf/transport
and /ncs-config/cli/ssh
by either disabling them or changing their values.
By default, clients connecting to the IPC socket are considered trusted, i.e. there is no authentication required, as the system relies on the use of 127.0.0.1 for /ncs-config/ncs-ipc-address/ip
or Unix domain sockets to prevent remote access. In case this is not sufficient, such as when untrusted users have shell access on the system where NSO runs, it is possible to further restrict the access to the IPC socket.
If Unix domain sockets are used, you can leverage Unix filesystem permissions for the socket path, to limit which OS users and groups can initiate connections to the socket. NSO may also perform additional authentication of the connecting users, see Authenticating IPC Access.
For TCP sockets, you can enable an access check by setting the ncs.conf
element /ncs-config/ncs-ipc-access-check/enabled
to true
, and specifying a filename for /ncs-config/ncs-ipc-access-check/filename
. The file should contain a shared secret, i.e., a random (printable ASCII) character string. Clients connecting to the IPC socket will then be required to prove that they have knowledge of the secret through a challenge handshake, before they are allowed access to the NSO functions provided via the IPC socket.
The access permissions on this file must be restricted via OS file permissions, such that it can only be read by the NSO daemon and client processes that are allowed to connect to the IPC port. E.g. if both the daemon and the clients run as root, the file can be owned by root and have only "read by owner" permission (i.e. mode 0400). Another possibility is to have a group that only the daemon and the clients belong to, set the group ID of the file to that group, and have only "read by group" permission (i.e. mode 040).
To provide the secret to the client libraries and inform them that they need to use the access check handshake, you have to set the environment variable NCS_IPC_ACCESS_FILE
to the full pathname of the file containing the secret. This is sufficient for all the clients mentioned above, i.e., there is no need to change the application code to support or enable this check.
The access check must be either enabled or disabled for both the daemon and the clients. E.g., if /ncs-config/ncs-ipc-access-check/enabled
in ncs.conf
is not set to true
but clients are started with the environment variable NCS_IPC_ACCESS_FILE
pointing to a file with a secret, the client connections will fail.
Handle tasks that require root privileges.
NSO requires some privileges to perform certain tasks. The following tasks may, depending on the target system, require root privileges.
Binding to privileged ports. The ncs.conf
configuration file specifies which port numbers NSO should bind(2)
to. If any of these port numbers are lower than 1024, NSO usually requires root privileges unless the target operating system allows NSO to bind to these ports as a non-root user.
If PAM is to be used for authentication, the program installed as $NCS_DIR/lib/ncs/priv/pam/epam
acts as a PAM client. Depending on the local PAM configuration, this program may require root privileges. If PAM is configured to read the local passwd
file, the program must either run as root
or be setuid
root. If the local PAM configuration instructs NSO to run, for example, pam_radius_auth
, root privileges are possibly not required depending on the local PAM installation.
If the CLI is used and we want to create CLI commands that run executables, we may want to modify the permissions of the $NCS_DIR/lib/ncs/lib/core/confd/priv/cmdptywrapper
program.
To be able to run an executable as root or a specific user, we need to make cmdptywrapper
setuid
root
, i.e.:
# chown root cmdptywrapper
# chmod u+s cmdptywrapper
Failing that, all programs will be executed as the user running the ncs
daemon. Consequently, if that user is the root
we do not have to perform the chmod
operations above. The same applies to executables run via actions, but then we may want to modify the permissions of the $NCS_DIR/lib/ncs/lib/core/confd/priv/cmdwrapper
program instead:
# chown root cmdwrapper
# chmod u+s cmdwrapper
NSO can be instructed to terminate NETCONF over cleartext TCP. This is useful for debugging since the NETCONF traffic can then be easily captured and analyzed. It is also useful if we want to provide some local proprietary transport mechanism that is not SSH. Clear text TCP termination is not authenticated, the clear text client simply tells NSO which user the session should run as. The idea is that authentication is already done by some external entity, such as an SSH server. If clear text TCP is enabled, NSO must bind to localhost (127.0.0.1) for these connections.
Client libraries connect to NSO. For example, the CDB API is socket based and a CDB client connects to NSO. We instruct NSO which address to use for these connections through the ncs.conf
parameters /ncs-config/ncs-ipc-address/ip
(default address 127.0.0.1) and /ncs-config/ncs-ipc-address/port
(default port 4565), or which Unix socket path to use with /ncs-config/ncs-local-ipc/path
(default /tmp/nso/nso-ipc
).
NSO multiplexes different kinds of connections on the same IPC socket. The following programs connect on the socket:
Remote commands, such as ncs --reload
CDB clients
External database API clients
MAAPI, the Management Agent API clients
The ncs_cli
program
Since the IPC socket allows full control of the system, it is important to ensure that only trusted or authorized clients can connect. See Restricting Access to the IPC Socket.
Design large and scalable NSO applications using LSA.
Layered Service Architecture (LSA) is a design approach for massively large and scalable NSO applications. Large service providers and enterprises can use it to manage services for millions of users, ranging over several hundred thousand managed devices. Such scale requires special consideration since a single NSO instance no longer suffices and LSA helps you address this challenge.
At some point, scaling up hits the law of diminishing returns. Effectively, adding more resources to the NSO server becomes prohibitively expensive. To further increase the throughput of the whole system, you can share the load across multiple instances, in a scale-out fashion.
You achieve this by splitting a service into a main, upper-layer part, and one or more lower-layer parts. The upper part controls and dispatches work to the lower parts. This is the same approach as using a customer-facing service (CFS) and a resource-facing service (RFS). However, here the CFS code (the upper-layer part) runs in a different NSO node than the RFS code (the lower-layer parts). What is more, the lower-layer parts can be spread across multiple NSO nodes.
Each RFS node is responsible for its own set of managed devices, mounted under its /devices
tree, and the upper-layer, CFS node only concerns itself with the RFS nodes. So, the CFS node only mounts the RFS nodes under its /devices
tree, not managed devices directly. The main advantage of this architecture is that you can add many device RFS nodes that collectively manage a huge number of actual devices—much more than a single node could.
While it is tempting to design the system in the most scalable way from the start, it comes with a cost. Compared to a single, non-LSA setup, the automation system now becomes distributed across multiple nodes, with all the complexity that entails. For example, in a non-distributed system, the communication between different parts has mostly negligible latency and hardly ever fails. That is certainly not true anymore for distributed systems as we know them today, including LSA.
More practically, taking a service in NSO and deploying a single instance on an LSA system is likely to take longer and have a higher chance of failure compared to a non-LSA system, because additional network communication is involved.
Moreover, multiple NSO nodes present a higher operational complexity and administrative burden. There is no longer a “single pane of glass” view of all the individual devices. That's why you must weigh the benefits of the LSA approach against the scale at which you operate. When LSA starts making sense will depend on the type of devices you manage, the services you have, the geographical distribution of resources, and so on.
A distributed system can push the overall throughput way beyond what a single instance can do. But you will achieve a much better outcome by first focusing on eliminating the bottlenecks in the provisioning code, as discussed in Scaling and Performance Optimization. Only when that proves insufficient, consider deploying LSA.
LSA also addresses the memory limitations of NSO when device configurations become very large (individually or all together). If the NSO server is memory-constrained and more memory cannot be added, the LSA approach can be a solution.
Another challenge that LSA may help you overcome is scaling organizationally. When many teams share the same NSO instance, it can get hard to separate the different concerns and responsibilities. Teams may also have different cadences or preferences for upgrades, resulting in friction. With LSA, it becomes possible to create a clearer separation. The CFS node and the RFS nodes can have different release cycles (as long as the YANG upgrade rules are followed) and each can be upgraded independently. If a bug is found or a feature is missing in the RFS nodes, it can be fixed without affecting the CFS node, and vice versa.
To summarize, the major advantage of this architecture is scalability. The solution scales horizontally, both at the upper and the lower layer, thus catering for truly massive deployments, but at the expense of the increased complexity.
To take advantage of the scalability potential of LSA, your services must be designed in a layered fashion. Once the automation logic in NSO reaches a certain level of complexity, a stacked service design tends to emerge naturally. Often, you can extend it to LSA with relatively little change. The same is true for brand-new, green field designs.
In other situations, you might need to invest some additional effort to split and orchestrate the work across multiple groups of devices. Examples are existing monolithic services or stacked service designs that require all RFSs to access all devices.
If you are designing the service from scratch, you have the most freedom in choosing the partitioning of logic between CFS and RFS. The CFS must contain the YANG definition for the service and its configurable options that are available to the customer, perhaps through an order capture system north of the NSO. On the other hand, the RFS YANG models are internal to the service, that is, they are not used directly by the customer. So, you are free to design them in a way that makes the provisioning code as simple as possible.
As an example, you might have a VLAN provisioning service where the CFS lets users select if the hosts on the VLAN can access the internet. Then you can divide provisioning into, let's say, an RFS service that configures the VLAN and the appropriate IP subnet across the data center switches, and another RFS service that configures the firewall to allow the traffic from the subnet to reach the internet. This design clearly separates the provisioned devices into two groups: firewalls and data center switches. Each group can be managed by a separate lower-layer NSO.
Similar to a brand new design, an existing monolithic application that uses stacked services has already laid the groundwork for LSA-compatible design because of the existing division into two layers (upper and lower).
A possible complication, in this case, is when each existing RFS touches all of the affected devices, and that makes it hard to partition devices across multiple lower-layer NSO nodes. For example, if one RFS manages the VLAN interface (the VLAN ID and layer 2 settings) and another RFS manages the IP configuration for this interface, that configuration very likely happens on the same devices. The solution in this situation could be to partition RFS services based on the data center that they operate in, such as one lower-layer NSO node for one data center, another lower-layer NSO for another data center, and so on. If that is not possible, an alternative is to redesign each RFS and split their responsibilities differently.
The most complex, yet common case is when a single node NSO installation grows over time and you are faced with performance problems due to the new size. To leverage the LSA functionality, you must first split the service into upper- and lower-layer parts, which require a certain amount of effort. That is why the decision to use LSA should always be accompanied by a thorough analysis to determine what makes the system too slow. Sometimes, it is a result of a bad "must" expression in the service YANG code or similar. Fixing that is much easier than re-architecting the application.
Regardless of whether you start with a green field design or extend an existing application, you must tackle the problem of dispatching the RFS instantiation to the correct lower-layer NSO node.
Imagine a VPN application that uses a managed device on each site to securely connect to the private network. In a service provider network, this is usually done by the CPE. When a customer orders connectivity to an additional site (another leg of the VPN), the service needs to configure the site-local device (the CPE). As there will be potentially many such devices, each will be managed by one of the RFS nodes. However, the VPN service is managed centrally, through the CFS, which must:
Figure out which RFS node is responsible for the device for the new site (CPE).
Dispatch the RFS instantiation to that particular RFS node, making sure the device is properly configured.
NSO provides a mechanism to facilitate the second part, the actual dispatch, but the service logic must somehow select the correct RFS node. If the RFS nodes are geographically separated across different countries or different data centers, the CFS could simply infer or calculate the right RFS node based on service instance parameters, such as the physical location of the new site.
A more flexible alternative is to use dynamic mapping. It can be as simple as a list of 2-tuples that map a device name to an RFS node, stored in the CDB. The trade-off is that the list must be maintained. It is straightforward to automate the maintenance of the list though, for example through NETCONF notifications whenever /devices/device
on the RFS nodes is manipulated or by explicitly asking the CFS node to query the RFS nodes for their list of devices.
Ultimately, the right approach to dispatch will depend on the complexity of your service and operational procedures.
Having designed a layered service with the CFS and RFS parts, the CFS must now communicate with the RFS that resides on a different node. You achieve that by adding the lower-layer (RFS) node as a managed device to the upper-layer (CFS) node. The CFS node must access the RFS data model on the lower-layer node, just like it accesses any other configuration on any managed device. But don't you need a NED to do this? Indeed, you do. That's why the RFS model needs to be specially compiled for the upper-layer node to use as part of NED and not a standalone service. A model compiled in this way is called a 'device compiled'.
Let's then see how the LSA setup affects the whole service provisioning process. Suppose a new request arrives at the CFS node, such as a new service instance being created through RESTCONF by a customer order portal. The CFS runs the service mapping logic as usual; however, instead of configuring the network devices directly, the CFS configures the appropriate RFS nodes with the generated RFS service instance data. This is the dispatch logic in action.
As the configuration for the lower-layer nodes happens under the /devices/device
tree, it is picked up and pushed to the relevant NSO instances by the NED. The NED sends the appropriate NETCONF edit-config RPCs, which trigger the RFS FASTMAP code at the RFS nodes. The RFS mapping logic constructs the necessary network configuration for each RFS instance and the RFS nodes update the actual network devices.
In case the commit queue feature is not being used, this entire sequence is serialized through the system as a whole. It means that if another northbound request arrives at the CFS node while the first request is being processed, the second request is synchronously queued at the CFS node, waiting for the currently running transaction to either succeed or fail.
If the code on the RFS nodes is reactive, it will likely return without much waiting, since the RFM applications are usually very fast during their first round of execution. But that will still have a lower performance than using the commit queue since the execution is serialized eventually when modifying devices. To maximize throughput, you also need to enable the commit queue functionality throughout the system.
The main benefit of LSA is that it scales horizontally at the RFS node layer. If one RFS node starts to become overloaded, it's easy to bring up an additional one, to share the load. Thus LSA caters to scalability at the level of the number of managed devices. However, each RFS node needs to host all the RFSs that touch the devices it manages under its /devices/device
tree. There is still one, and only one, NSO node that directly manages a single device.
Dividing a provisioning application into upper and lower-layer services also increases the complexity of the application itself. For example, to follow the execution of a reactive or nano RFS, typically an additional NETCONF notification code must be written. The notifications have to be sent from the RFS nodes and received and processed by the CFS code. This way, if something goes wrong at the device layer, the information is relayed all the way to the top level of the system.
Furthermore, it is highly recommended that LSA applications enable the commit queue on all NSO nodes. If the commit queue is not enabled, the slowest device on the network will limit the overall throughput, significantly reducing the benefits of LSA.
Finally, if the two-layer approach proves to be insufficient due to requirements at the CFS node, you can extend it to three layers, with an additional layer of NSO nodes between the CFS and RFS layers.
This section describes a small LSA application, which exists as a running example in the examples.ncs/layered-services-architecture/lsa-single-version-deployment directory.
The application is a slight variation on the examples.ncs/service-management/rfs-service example where the YANG code has been split up into an upper-layer and a lower-layer implementation. The example topology (based on netsim for the managed devices, and NSO for the upper/lower layer NSO instances) looks like the following:
The upper layer of the YANG service data for this example looks like the following:
Instantiating one CFS we have:
The provisioning code for this CFS has to make a decision on where to instantiate what. In this example the "what" is trivial, it's the accompanying RFS, whereas the "where" is more involved. The two underlying RFS nodes, each manage 3 netsim routers, thus given the input, the CFS code must be able to determine which RFS node to choose. In this example, we have chosen to have an explicit map, thus on the upper-nso
we also have:
So, we have a template CFS code that does the dispatching to the right RFS node.
This technique for dispatching is simple and easy to understand. The dispatching might be more complex, it might even be determined at execution time dependent on CPU load. It might be (as in this example) inferred from input parameters or it might be computed.
The result of the template-based service is to instantiate the RFS, at the RFS nodes.
First, let's have a look at what happened in the upper-nso. Look at the modifications but ignore the fact that this is an LSA service:
Just the dispatched data is shown. As ex0
and ex5
reside on different nodes, the service instance data has to be sent to both lower-nso-1
and lower-nso-2
.
Now let's see what happened in the lower-nso
. Look at the modifications and take into account that these are LSA nodes (this is the default):
Both the dispatched data and the modification of the remote service are shown. As ex0
and ex5
reside on different nodes, the service modifications of the service rfs-vlan
on both lower-nso-1
and lower-nso-2
are shown.
The communication between the NSO nodes is of course NETCONF.
The YANG model at the lower layer, also known as the RFS layer, is similar to the CFS, but slightly different:
The task for the RFS provisioning code here is to actually provision the designated router. If we log into one of the lower layer NSO nodes, we can check the following.
To conclude this section, the final remark here is that to design a good LSA application, the trick is to identify a good layering for the service data models. The upper layer, the CFS layer is what is exposed northbound, and thus requires a model that is as forward-looking as possible since that model is what a system north of NSO integrates to, whereas the lower layer models, the RFS models can be viewed as "internal system models" and they can be more easily changed.
In this section, we'll describe a lightly modified version of the example in the previous section. The application we describe here exists as a running example under examples.ncs/layered-services-architecture/lsa-scaling.
Sometimes it is desirable to be able to easily move devices from one lower LSA node to another. This makes it possible to easily expand or shrink the number of lower LSA nodes. Additionally, it is sometimes desirable to avoid HA pairs for replication but instead use a common store for all lower LSA devices, such as a distributed database, or a common file system.
The above is possible provided that the LSA application is structured in certain ways.
The lower LSA nodes only expose services that manipulate the configuration of a single device. We call these devices RFSs, or dRFS for short.
All services are located in a way that makes it easy to extract them, for example in /drfs:dRFS/device
No RFS takes place on the lower LSA nodes. This avoids the complication with locking and distributed event handling.
The LSA nodes need to be set up with the proper NEDs and with auth groups such that a device can be moved without having to install new NEDs or update auth groups.
Provided that the above requirements are met, it is possible to move a device from one lower LSA node by extracting the configuration from the source node and installing it on the target node. This, of course, requires that the source node is still alive, which is normally the case when HA-pairs are used.
An alternative to using HA-pairs for the lower LSA nodes is to extract the device configuration after each modification to the device and store it in some central storage. This would not be recommended when high throughput is required but may make sense in certain cases.
In the example application, there are two packages on the lower LSA nodes that provide this functionality. The package inventory-updater
installs a database subscriber that is invoked every time any device configuration is modified, both in the preparation phase and in the commit phase of any such transaction. It extracts the device and dRFS configuration, including service metadata, during the preparation phase. If the transaction proceeds to a full commit, the package is again invoked and the extracted configuration is stored in a file in the directory db_store
.
The other package is called device-actions
. It provides three actions: extract-device
, install-device
, and delete-device
. They are intended to be used by the upper LSA node when moving a device either from a lower LSA node or from db_store
.
In the upper LSA node, there is one package for coordinating the movement, called move-device
. It provides an action for moving a device from one lower LSA node to another. For example when invoked to move device ex0
from lower-1
to lower-2
using the action
it goes through the following steps:
A partial lock is acquired on the upper-nso for the path /devices/device[name=lower-1]/config/dRFS/device[name=ex0]
to avoid any changes to the device while the device is in the process of being moved.
The device and dRFS configuration are extracted in one of two ways:
Read the configuration from lower-1
using the action
Read the configuration from some central store, in our case the file system in the directory. db_store
.
The configuration will look something like this
Install the configuration on the lower-2
node. This can be done by running the action:
This will load the configuration and commit using the flags no-deploy
and no-networking
.
Delete the device from lower-1
by running the action
Update mapping table
Release the partial lock for /devices/device[name=lower-1]/config/dRFS/device[name=ex0]
.
Re-deploy all services that have touched the device. The services all have backpointers from /devices/device{lower-1}/config/dRFS/device{ex0}
. They are re-deployed
using the flags no-lsa
and no-networking
.
Finally, the action runs compare-config
on lower-1
and lower-2
.
With this infrastructure in place, it is fairly straightforward to implement actions for re-balancing devices among lower LSA nodes, as well as evacuating all devices from a given lower LSA node. The example contains implementations of those actions as well.
If we do not have the luxury of designing our NSO service application from scratch, but rather are faced with extending/changing an existing, already deployed application into the LSA architecture we can use the techniques described in this section.
Usually, the reasons for rearchitecting an existing application are performance-related.
In the NSO example collection, two popular examples are the examples.ncs/service-management/mpls-vpn-java and examples.ncs/service-management/mpls-vpn-python examples. Those example contains an almost "real" VPN provisioning example whereby VPNs are provisioned in a network of CPEs, PEs, and P routers according to this picture:
The service model in this example roughly looks like this:
There are several interesting observations on this model code related to the Layered Service Architecture.
Each instantiated service has a list of endpoints and CPE routers. These are modeled as a leafref into the /devices tree. This has to be changed if we wish to change this application into an LSA application since the /devices tree at the upper layer doesn't contain the actual managed routers. Instead, the /devices tree contains the lower layer RFS nodes.
There is no connectivity/topology information in the service model. Instead, the mpls-vpn
example has topology information on the side, and that data is used by the provisioning code. That topology information for example contains data on which CE routers are directly connected to which PE router.
Remember from the previous section, that one of the additional complications of an LSA application is the dispatching part. The dispatching problem fits well into the pattern where we have topology information stored on the side and let the provisioning FASTMAP code use that data to guide the provisioning. One straightforward way would be to augment the topology information with additional data, indicating which RFS node is used to manage a specific managed device.
By far the easiest way to change an existing monolithic NSO application into the LSA architecture is to keep the service model at the upper layer and lower layer almost identical, only changing things like leafrefs directly into the /devices tree which obviously breaks.
In this example, the topology information is stored in a separate container share-data
and propagated to the LSA nodes by means of service code.
The example examples.ncs/layered-services-architecture/mpls-vpn-lsa example does exactly this, the upper layer data model in upper-nso/packages/l3vpn/src/yang/l3vpn.yang
now looks as:
The ce-device
leaf is now just a regular string, not a leafref.
So, instead of an NSO topology that looks like:
We want an NSO architecture that looks like this:
The task for the upper layer FastMap code is then to instantiate a copy of itself on the right lower layer NSO nodes. The upper layer FastMap code must:
Determine which routers, (CE, PE, or P) will be touched by its execution.
Look in its dispatch table, which lower-layer NSO nodes are used to host these routers.
Instantiate a copy of itself on those lower layer NSO nodes. One extremely efficient way to do that is to use the Maapi.copyTree()
method. The code in the example contains code that looks like this:
Finally, we must make a minor modification to the lower layer (RFS) provisioning code too. Originally, the FastMap code wrote all config for all routers participating in the VPN, now with the LSA partitioning, each lower layer NSO node is only responsible for the portion of the VPN that involves devices that reside in its /devices tree, thus the provisioning code must be changed to ignore devices that do not reside in the /devices tree.
In addition to conceptual changes of splitting into upper- and lower-layer parts, migrating an existing monolithic application to LSA may also impact the models used. In the new design, the upper-layer node contains the (more or less original) CFS model as well as the device-compiled RFS model, which it requires for communication with the RFS nodes. In a typical scenario, these are two separate models. So, for example, they must each use a unique namespace.
To illustrate the different YANG files and namespaces used, the following text describes the process of splitting up an example monolithic service. Let's assume that the original service resides in a file, myserv.yang
, and looks like the following:
In an LSA setting, we want to keep this module as close to the original as possible. We clearly want to keep the namespace, the prefix, and the structure of the YANG identical to the original. This is to not disturb any provisioning systems north of the original NSO. Thus with only minor modifications, we want to run this module at the CFS node, but with non-applicable leafrefs removed, thus at the CFS node we would get:
Now, we want to run almost the same YANG module at the RFS node, however, the namespace must be changed. For the sake of the CFS node, we're going to NED compile the RFS and NSO doesn't like the same namespace to occur twice, thus for the RFS node, we would get a YANG module myserv-rfs.yang
that looks like the following:
This file can, and should, keep the leafref as is.
The final and last file we get is the compiled NED, which should be loaded in the CFS node. The NED is directly compiled from the RFS model, as an LSA NED.
Thus, we end up with three distinct packages from the original one:
The original, slated for the CFS node, with leafrefs removed.
The modified original, slated for the RFS node, with the namespace and the prefix changed.
The NED, compiled from the RFS node code, slated for the CFS node.
The purpose of the upper CFS node is to manage all CFS services and to push the resulting service mappings to the RFS services. The lower RFS nodes are configured as devices in the device tree of the upper CFS node and the RFS services are created under the /devices/device/config
accordingly. This is almost identical to the relation between a normal NSO node and the normal devices. However, there are differences when it comes to commit parameters and the commit queue, as well as some other LSA-specific features.
Such a design allows you to decide whether you will run the same version of NSO on all nodes or not. Since some differences arise between the two options, this document distinguishes a single-version deployment from a multi-version one.
Deployment of an LSA cluster where all the nodes have the same major version of NSO running is called a single version deployment. If the versions are different, then it is a multi-version deployment, since the packages on the CFS node must be managed differently.
The choice between the two deployment options depends on your functional needs. The single version is easier to maintain and is a good starting point but is less flexible. While it is possible to migrate from one to the other, the migration from a single version to a multi-version is typically easier than the other way around. Still, every migration requires some effort, so it is best to pick one approach and stick to it.
You can find working examples of both deployment types in the examples.ncs/layered-services-architecture/lsa-single-version-deployment and examples.ncs/layered-services-architecture/lsa-multi-version-deployment folders, respectively.
The type of deployment does not affect the RFS nodes. In general, the RFS nodes act very much like ordinary standalone NSO instances but only support the RFS services.
Configure and set up the lower RFS nodes as you would a standalone node, by making sure the necessary NED and RFS packages are loaded and the managed network devices added. This requires you to have already decided on the distribution of devices to lower RFS nodes. The RFS packages are ordinary service packages.
The only LSA-specific requirement is that these nodes enable NETCONF communication northbound, as this is how the upper CFS node will interact with them. To enable NETCONF northbound, ensure that a configuration similar to the following is present in the ncs.conf
of every RFS node:
One thing to note is that you do not need to explicitly enable the commit queue on the RFS nodes, even if you intend to use LSA with the commit queue feature. The upper CFS node is aware of the LSA setup and will propagate the relevant commit flags to the lower RFS nodes automatically.
If you wish to enable the commit queue by default, that is, even for transactions originating on the RFS node (non-LSA), you are strongly encouraged to enable it globally, through the /devices/global-settings/commit-queue/enabled-by-default
setting on all the RFS nodes and, importantly, the upper CFS node too. Otherwise, you may end up in a situation where only a part of the transaction runs through the commit queue. In that case, the rollback-on-error
commit queue error option will not work correctly, as it can't roll back the full original transaction but just the part that went through the commit queue. This can result in an inconsistent network state.
Regardless of single or multi-version deployment, the upper CFS node has the lower RFS nodes configured as devices under the /devices/device
tree. The CFS node communicates with these devices through NETCONF and must have the correct ned-id
configured for each lower RFS node. The ned-id
is set under /devices/device/device-type/netconf/ned-id
, as for any NETCONF device.
The part that is specific to LSA is the actual ned-id
used. This has to be ned:lsa-netconf
or a ned-id
derived from it. What is more, the ned-id
depends on the deployment type. For a single-version deployment, you can use the lsa-netconf
value directly. This ned-id
is built-in (defined in tailf-ncs-ned.yang
) and available in NSO without any additional packages.
So the configuration for the RFS device in the CFS node would look similar to:
Notice the use of the lsa-remote-node
instead of the address
(and port
) as is usually done. This setting identifies the device as a lower-layer LSA node and instructs NSO to use connection information provided under cluster
configuration.
The value of lsa-remote-node
references a cluster remote-node
, such as the following:
In addition to devices device
, the authgroup
value is again required here and refers to cluster authgroup
, not the device one. Both authgroups must be configured correctly for LSA to function.
Having added device and cluster configuration for all RFS nodes, you should update the SSH host keys for both, the /devices/device
and /cluster/remote-node
paths. For example:
Moreover, the RFS NSO nodes have an extra configuration that may not be visible to the CFS node, resulting in out-of-sync behavior. You are strongly encouraged to set the out-of-sync-commit-behaviour
value to accept
, with a command such as:
At the same time you should also enable the /cluster/device-notifications
, which will allow the CFS node to receive the forwarded device notifications from the RFS nodes, and /cluster/commit-queue
, to enable the commit queue support for LSA. Without the latter, you will not be able to use the commit commit-queue async
command, for example.
If you wish to enable the commit queue by default, you should do so by setting the /devices/global-settings/commit-queue/enabled-by-default
on the CFS node. Do not use per device or per device group configuration, for the same reason you should avoid it on the RFS nodes.
If you plan a single-version deployment, the preceding steps are sufficient. For a multi-version deployment, on the other hand, there are two additional tasks to perform.
First, you will need to install the correct Cisco-NSO LSA NED package (or packages if you need to support more versions). Each NSO release includes these packages that are specifically tailored for LSA. They are used by the upper CFS node if the lower RFS nodes are running a different version than the CFS node itself. The packages are named cisco-nso-nc-X.Y
where X.Y are the two most significant numbers of the NSO release (the major version) that the package supports. So, if your RFS nodes are running NSO 5.7.2, for example, you should use cisco-nso-nc-5.7
.
These packages are found in the $NCS_DIR/packages/lsa
directory. Each package contains the complete model of the ncs
namespace for the corresponding NSO version, compiled as an LSA NED. Please always use the cisco-nso
package included with the NSO version of the upper CFS node and not some older variant (such as the one from the lower RFS node) as it may not work correctly.
Second, installing the cisco-nso LSA NED package will make the corresponding ned-id
available, such as cisco-nso-nc-5.7
(ned-id
matches the package name). Use this ned-id
for the RFS nodes instead of lsa-netconf
. For example:
This configuration allows the CFS node to communicate with a different NSO version but there are still some limitations. The upper CFS node must have the same or newer version than the managed RFS nodes. For all the currently supported versions of the lower node, the packages can be found in the $NCS_DIR/packages/lsa
directory, but you may also be able to build an older one yourself.
In case you already have a single-version deployment using the lsa-netconf
ned-id'
s, you can use the NED migrate procedure to switch to the new ned-id
and multi-version deployment.
Besides adding managed lower-layer nodes, the upper-layer node also requires packages for the services. Obviously, you must add the CFS package, which is an ordinary service package, to the CFS node. But you must also provide the device compiled RFS YANG models to allow provisioning of RFSs on the remote RFS nodes.
The process resembles the way you create and compile device YANG models in normal NED packages. The ncs-make-package
tool provides the --lsa-netconf-ned
option, where you specify the location of the RFS YANG model and the tool creates a NED package for you. This is a new package that is separate from the RFS package used in the RFS nodes, so you might want to name it differently to avoid confusion. The following text uses the -ned
suffix.
Usually, you would also provide the --no-netsim
, --no-java
, and --no-python
switches to the invocation, as the package is used with the NETCONF protocol and doesn't need any additional code. The --no-netsim
option is required because netsim is not supported for these types of packages. For example:
In this case, there is no explicit --lsa-lower-nso
option specified and ncs-make-package
will by default be set up to compile the package for the single version deployment, tied to the lsa-netconf
ned-id
. That means the models in the NED can be used with devices that have a lsa-netconf
ned-id
configured.
To compile it for the multi-version deployment, which uses a different ned-id
, you must select the target NSO version with the --lsa-lower-nso cisco-nso-nc-X.Y
option, for example:
Depending on the RFS model, the package may fail to compile, even though the model compiles fine as a service. A typical error would indicate some node from a module, such as tailf-ncs
, is not found. The reason is that the original RFS service YANG model has dependencies on other YANG models that are not included in the compilation process.
One solution to this problem is to remove the dependencies in the YANG model before compilation. Normally this can be solved by changing the datatype in the NED compiled copy of the YANG model, for example from leafref
or instance-identifier
to string. This is only needed for the NED compiled copy, the lower RFS node YANG model can remain the same. There will then be an implicit conversion between types, at runtime, in the communication between the upper CFS node and the lower RFS node.
An alternate solution, if you are doing a single version deployment and there are dependencies on the tailf-ncs
namespace, is to switch to a multi-version deployment because the cisco-nso
package includes this namespace (device compiled). Here, the NSO versions match but you are still using the cisco-nso-nc-X.Y
ned-id
and have to follow the instructions for the multi-version deployment.
Once you have both, the CFS and device-compiled RFS service packages are ready; add them to the CFS node, then invoke a sync-from
action to complete the setup process.
You can see all the required setup steps for a single version deployment performed in the example examples.ncs/layered-services-architecture/lsa-single-version-deployment and the examples.ncs/layered-services-architecture/lsa-multi-version-deployment has the steps for the multi-version one. The two are quite similar but the multi-version deployment has additional steps, so it is the one described here.
First, build the example for manual setup.
Then configure the nodes in the cluster. This is needed so that the upper CFS node can receive notifications from the lower RFS node and prepare the upper CFS node to be used with the commit queue.
To be able to handle the lower NSO node as an LSA node, the correct version of the cisco-nso-nc
package needs to be installed. In this example, 5.4 is used.
Create a link to the cisco-nso
package in the packages directory of the upper CFS node:
Reload the packages:
Now when the cisco-nso-nc
package is in place, configure the two lower NSO nodes and sync-from
them:
Now, for example, the configured devices of the lower nodes can be viewed:
Or, alarms inspected:
Now, create a netconf package on the upper CFS node which can be used towards the rfs-vlan
service on the lower RFS node, in the shell terminal window, do the following:
The created NED is an lsa-netconf-ned
based on the YANG files of the rfs-vlan
service:
The version of the NED reflects the version of the nso on the lower node:
The package will be generated in the packages directory of the upper NSO CFS node:
And, the name of the package will be:
Install the cfs-vlan
service on the upper CFS node. In the shell terminal window, do the following:
Reload the packages once more to get the cfs-vlan
package. In the CLI terminal window, do the following:
Now, when all packages are in place a cfs-vlan
service can be configured. The cfs-vlan
service will dispatch service data to the right lower RFS node depending on the device names used in the service.
In the CLI terminal window, verify the service:
As ex0
resides on lower-nso-1
that part of the configuration goes there and the ex5
part goes to lower-nso-2
.
Since an LSA deployment consists of multiple NSO nodes (or HA pairs of nodes), each can be upgraded to a newer NSO version separately. While that offers a lot of flexibility, it also makes upgrades more complex in many cases. For example, performing a major version upgrade on the upper CFS node only will make the deployment Multi-Version even if it was Single-Version before the upgrade, requiring additional action on your part.
In general, staying with the Single-Version Deployment is the simplest option and does not require any further LSA-specific upgrade action (except perhaps recompiling the packages). However, the main downside is that, at least for a major upgrade, you must upgrade all the nodes at the same time (otherwise, you no longer have a Single-Version Deployment).
If that is not feasible, the solution is to run a Multi-Version Deployment. Along with all of the requirements, the section Multi-Version Deployment describes a major difference from the Single Version variant: the upper CFS node uses a version-specific cisco-nso-nc-X.Y
NED ID to refer to lower RFS nodes. That means, if you switch to a Multi-Version Deployment, or perform a major upgrade of the lower-layer RFS node, the ned-id
should change accordingly. However, do not change it directly but follow the correct NED upgrade procedure described in the section called NED Migration. Briefly, the procedure consists of these steps:
Keep the currently configured ned-id for an RFS device and the corresponding packages. If upgrading the CFS node, you will need to recompile the packages for the new NSO version.
Compile and load the packages that are device-compiled with the new ned-id
, alongside the old packages.
Use the migrate
action on a device to switch over to the new ned-id
.
The procedure requires you to have two versions of the device-compiled RFS service packages loaded in the upper CFS node when calling the migrate
action: one version compiled by referencing the old (current) NED ID and the other one by referencing the new (target) NED ID.
To illustrate, suppose you currently have an upper-layer and a lower-layer node both running NSO 5.4. The nodes were set up as described in the Single Version Deployment option, with the upper CFS node using the tailf-ncs-ned:lsa-netconf
NED ID for the lower-layer RFS node. The CFS node also uses the rfs-vlan-ned
NED package for the rfs-vlan
service.
Now you wish to upgrade the CFS node to NSO 5.7 but keep the RFS node on the existing version 5.4. Before upgrading the CFS node, you create a backup and recompile the rfs-vlan-ned
package for NSO 5.7. Note that the package references the lsa-netconf
ned-id
, which is the ned-id
configured for the RFS device in the CFS node's CDB. Then, you perform the CFS node upgrade as usual.
At this point the CFS node is running the new, 5.7 version and the RFS node is running 5.4. Since you now have a Multi Version Deployment, you should migrate to the correct ned-id
as well. Therefore, you prepare the rfs-vlan-nc-5.4
package, as described in the Multi-Version Deployment option, compile the package, and load it into the CFS node. Thanks to the NSO CDM feature, both packages, rfs-vlan-nc-5.4
and rfs-vlan-ned
, can be used at the same time.
With the packages ready, you execute the devices device lower-nso-1 migrate new-ned-id cisco-nso-nc-5.4
command on the CFS node. The command configures the RFS device entry on CFS to use the new cisco-nso-nc-5.4 ned-id
, as well as migrates the device configuration and service meta-data to the new model. Having completed the upgrade, you can now remove the rfs-vlan-ned
if you wish.
Later on, you may decide to upgrade the RFS node to NSO 5.6. Again, you prepare the new rfs-vlan-nc-5.6
package for the CFS node in a similar way as before, now using the cisco-nso-nc-5.6
ned-id instead of cisco-nso-nc-5.4
. Next, you perform the RFS node upgrade to 5.6 and finally migrate the RFS device on the CFS node to the cisco-nso-nc-5.6 ned-id
, with the migrate action.
Likewise, you can return to the Single Version Deployment, by upgrading the RFS node to the NSO 5.7, reusing the old, or preparing anew, the rfs-vlan-ned
package and migrating to the lsa-netconf ned-id
.
All these ned-id
changes stem from the fact that the upper-layer CFS node treats the lower-layer RFS node as a managed device, requiring the correct model, just like it does for any other device type. For the same reason, maintenance (bug fix or patch) NSO upgrades do not result in a changed ned-id
, so for those no migration is necessary.
In LSA, northbound users are authenticated on the CFS, and the request is re-authenticated on the RFS using either a system user or user/pass passthrough.
For token-based authentication using external auth/package auth, this becomes a problem as the user and password are not expected to be locally provisioned and hence cannot be used for authentication towards the RFS, which leaves the option of a system user.
Using a system user has two major limitations:
Auditing on the RFS becomes hard, as system sessions are not logged in the audit.log
.
Device-level RBAC becomes challenging as the devices reside in the RFS and the user information is lost.
To handle this scenario, one can enable the passthrough of the user name and its groups to lower layer nodes to allow the session on the RFS to assume the same user as used on the CFS (similar to use of "sudo"). This will allow for the use of a system user between the CFS and RFS while allowing for auditing and RBAC on the RFS using the locally authenticated user on the CFS.
On the CFS node, create an authgroup under /devices/authgroups/group
with the /devices/authgroups/group/{umap,default-map}/passthrough
empty leaf set, then select this authgroup on the configured RFS nodes by setting the /devices/device/authgroup
leaf. When the passthrough leaf is set and a user (e.g., alice) on the CFS node connects to an RFS node, she will authenticate using the credentials specified in the /devices/device/authgroup
authgroup (e.g., lsa_passthrough_user
: ahVaesai8Ahn0AiW
). Once the authentication completes successfully, the user lsa_passthrough_user
changes into alice on the RFS node.
On the RFS node, configure the mapping of permitted users in the /cluster/global-settings/passthrough/permit
list. The key of the permit list specifies what user may change into a different user. The different possible users to change into are specified by the as-user
leaf-list, and the as-group
leaf-list specifies valid groups. The user will end up with the intersection of groups in the user session on the CFS and the groups specified by the as-group
leaf-list. Only users in the permit list will be allowed to change into the users set in the permit list elements as-user
list.