redis/src/bio.h

47 lines
1.6 KiB
C
Raw Normal View History

/*
* Copyright (c) 2009-Present, Redis Ltd.
* All rights reserved.
*
* Licensed under your choice of the Redis Source Available License 2.0
* (RSALv2) or the Server Side Public License v1 (SSPLv1).
*/
2019-11-21 02:18:00 +01:00
#ifndef __BIO_H
#define __BIO_H
typedef void lazy_free_fn(void *args[]);
Change FLUSHALL/FLUSHDB SYNC to run as blocking ASYNC (#13167) # Overview Users utilize the `FLUSHDB SYNC` and `FLUSHALL SYNC` commands for a variety of reasons. The main issue with this command is that if the database becomes substantial in size, the server will be unresponsive for an extended period. Other than freezing application traffic, this may also lead some clients making incorrect judgments about the server's availability. For instance, a watchdog may erroneously decide to terminate the process, resulting in potential adverse outcomes. While a `FLUSH* ASYNC` can address these issues, it might not be used for two reasons: firstly, it's not the default, and secondly, in some cases, the client issuing the flush wants to wait for its completion before repopulating the database. Between the option of triggering FLUSH* asynchronously in the background without indication for completion versus running it synchronously in the foreground by the main thread, there is another more appealing option. We can block the client that requested the flush, execute the flush command in the background, and once done, unblock the client and return notification for completion. This approach ensures the server remains responsive to other clients, and the blocked client receives the expected response only after the flush operation has been successfully carried out. # Implementation details Instead of defining yet another flavor to the flush command, we can modify `FLUSHALL SYNC` and `FLUSHDB SYNC` always run in this new mode. ## Extending BIO Threads capabilities Today jobs that are carried out by BIO threads don't have the capability to indicate completion to the main thread. We can add this infrastructure by having an additional dummy job, coined as completion-job, that eventually will be written by BIO threads to a response-queue. The main thread will take care to consume items from the response-queue and call the provided callback function of each completion-job. ## FLUSH* SYNC to run as blocking ASYNC Command `FLUSH* SYNC` will be modified to create one or more async jobs to flush DB(s) and afterward will push additional completion-job request. By sending the completion job request only at the end, the main thread will be called back only after all the preceding jobs completed their task in the background. During that time, the client of the command is suspended and marked as `BLOCKED_LAZYFREE` whereas any other client will be able to communicate with the server without any issue.
2024-04-02 14:09:52 +02:00
typedef void comp_fn(uint64_t user_data);
typedef enum bio_worker_t {
BIO_WORKER_CLOSE_FILE = 0,
BIO_WORKER_AOF_FSYNC,
BIO_WORKER_LAZY_FREE,
BIO_WORKER_NUM
} bio_worker_t;
/* Background job opcodes */
typedef enum bio_job_type_t {
BIO_CLOSE_FILE = 0, /* Deferred close(2) syscall. */
BIO_AOF_FSYNC, /* Deferred AOF fsync. */
BIO_LAZY_FREE, /* Deferred objects freeing. */
BIO_CLOSE_AOF,
BIO_COMP_RQ_CLOSE_FILE, /* Job completion request, registered on close-file worker's queue */
BIO_COMP_RQ_AOF_FSYNC, /* Job completion request, registered on aof-fsync worker's queue */
BIO_COMP_RQ_LAZY_FREE, /* Job completion request, registered on lazy-free worker's queue */
BIO_NUM_OPS
} bio_job_type_t;
2011-09-13 15:59:48 +02:00
/* Exported API */
void bioInit(void);
unsigned long bioPendingJobsOfType(int type);
Implementing the WAITAOF command (issue #10505) (#11713) Implementing the WAITAOF functionality which would allow the user to block until a specified number of Redises have fsynced all previous write commands to the AOF. Syntax: `WAITAOF <num_local> <num_replicas> <timeout>` Response: Array containing two elements: num_local, num_replicas num_local is always either 0 or 1 representing the local AOF on the master. num_replicas is the number of replicas that acknowledged the a replication offset of the last write being fsynced to the AOF. Returns an error when called on replicas, or when called with non-zero num_local on a master with AOF disabled, in all other cases the response just contains number of fsync copies. Main changes: * Added code to keep track of replication offsets that are confirmed to have been fsynced to disk. * Keep advancing master_repl_offset even when replication is disabled (and there's no replication backlog, only if there's an AOF enabled). This way we can use this command and it's mechanisms even when replication is disabled. * Extend REPLCONF ACK to `REPLCONF ACK <ofs> FACK <ofs>`, the FACK will be appended only if there's an AOF on the replica, and already ignored on old masters (thus backwards compatible) * WAIT now no longer wait for the replication offset after your last command, but rather the replication offset after your last write (or read command that caused propagation, e.g. lazy expiry). Unrelated changes: * WAIT command respects CLIENT_DENY_BLOCKING (not just CLIENT_MULTI) Implementation details: * Add an atomic var named `fsynced_reploff_pending` that's updated (usually by the bio thread) and later copied to the main `fsynced_reploff` variable (only if the AOF base file exists). I.e. during the initial AOF rewrite it will not be used as the fsynced offset since the AOF base is still missing. * Replace close+fsync bio job with new BIO_CLOSE_AOF (AOF specific) job that will also update fsync offset the field. * Handle all AOF jobs (BIO_CLOSE_AOF, BIO_AOF_FSYNC) in the same bio worker thread, to impose ordering on their execution. This solves a race condition where a job could set `fsynced_reploff_pending` to a higher value than another pending fsync job, resulting in indicating an offset for which parts of the data have not yet actually been fsynced. Imposing an ordering on the jobs guarantees that fsync jobs are executed in increasing order of replication offset. * Drain bio jobs when switching `appendfsync` to "always" This should prevent a write race between updates to `fsynced_reploff_pending` in the main thread (`flushAppendOnlyFile` when set to ALWAYS fsync), and those done in the bio thread. * Drain the pending fsync when starting over a new AOF to avoid race conditions with the previous AOF offsets overriding the new one (e.g. after switching to replicate from a new master). * Make sure to update the fsynced offset at the end of the initial AOF rewrite. a must in case there are no additional writes that trigger a periodic fsync, specifically for a replica that does a full sync. Limitations: It is possible to write a module and a Lua script that propagate to the AOF and doesn't propagate to the replication stream. see REDISMODULE_ARGV_NO_REPLICAS and luaRedisSetReplCommand. These features are incompatible with the WAITAOF command, and can result in two bad cases. The scenario is that the user executes command that only propagates to AOF, and then immediately issues a WAITAOF, and there's no further writes on the replication stream after that. 1. if the the last thing that happened on the replication stream is a PING (which increased the replication offset but won't trigger an fsync on the replica), then the client would hang forever (will wait for an fack that the replica will never send sine it doesn't trigger any fsyncs). 2. if the last thing that happened is a write command that got propagated properly, then WAITAOF will be released immediately, without waiting for an fsync (since the offset didn't change) Refactoring: * Plumbing to allow bio worker to handle multiple job types This introduces infrastructure necessary to allow BIO workers to not have a 1-1 mapping of worker to job-type. This allows in the future to assign multiple job types to a single worker, either as a performance/resource optimization, or as a way of enforcing ordering between specific classes of jobs. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-14 19:26:21 +01:00
void bioDrainWorker(int job_type);
void bioKillThreads(void);
Reclaim page cache of RDB file (#11248) # Background The RDB file is usually generated and used once and seldom used again, but the content would reside in page cache until OS evicts it. A potential problem is that once the free memory exhausts, the OS have to reclaim some memory from page cache or swap anonymous page out, which may result in a jitters to the Redis service. Supposing an exact scenario, a high-capacity machine hosts many redis instances, and we're upgrading the Redis together. The page cache in host machine increases as RDBs are generated. Once the free memory drop into low watermark(which is more likely to happen in older Linux kernel like 3.10, before [watermark_scale_factor](https://lore.kernel.org/lkml/1455813719-2395-1-git-send-email-hannes@cmpxchg.org/) is introduced, the `low watermark` is linear to `min watermark`, and there'is not too much buffer space for `kswapd` to be wake up to reclaim memory), a `direct reclaim` happens, which means the process would stall to wait for memory allocation. # What the PR does The PR introduces a capability to reclaim the cache when the RDB is operated. Generally there're two cases, read and write the RDB. For read it's a little messy to address the incremental reclaim, so the reclaim is done in one go in background after the load is finished to avoid blocking the work thread. For write, incremental reclaim amortizes the work of reclaim so no need to put it into background, and the peak watermark of cache can be reduced in this way. Two cases are addresses specially, replication and restart, for both of which the cache is leveraged to speed up the processing, so the reclaim is postponed to a right time. To do this, a flag is added to`rdbSave` and `rdbLoad` to control whether the cache need to be kept, with the default value false. # Something deserve noting 1. Though `posix_fadvise` is the POSIX standard, but only few platform support it, e.g. Linux, FreeBSD 10.0. 2. In Linux `posix_fadvise` only take effect on writeback-ed pages, so a `sync`(or `fsync`, `fdatasync`) is needed to flush the dirty page before `posix_fadvise` if we reclaim write cache. # About test A unit test is added to verify the effect of `posix_fadvise`. In integration test overall cache increase is checked, as well as the cache backed by RDB as a specific TCL test is executed in isolated Github action job.
2023-02-12 08:23:29 +01:00
void bioCreateCloseJob(int fd, int need_fsync, int need_reclaim_cache);
Implementing the WAITAOF command (issue #10505) (#11713) Implementing the WAITAOF functionality which would allow the user to block until a specified number of Redises have fsynced all previous write commands to the AOF. Syntax: `WAITAOF <num_local> <num_replicas> <timeout>` Response: Array containing two elements: num_local, num_replicas num_local is always either 0 or 1 representing the local AOF on the master. num_replicas is the number of replicas that acknowledged the a replication offset of the last write being fsynced to the AOF. Returns an error when called on replicas, or when called with non-zero num_local on a master with AOF disabled, in all other cases the response just contains number of fsync copies. Main changes: * Added code to keep track of replication offsets that are confirmed to have been fsynced to disk. * Keep advancing master_repl_offset even when replication is disabled (and there's no replication backlog, only if there's an AOF enabled). This way we can use this command and it's mechanisms even when replication is disabled. * Extend REPLCONF ACK to `REPLCONF ACK <ofs> FACK <ofs>`, the FACK will be appended only if there's an AOF on the replica, and already ignored on old masters (thus backwards compatible) * WAIT now no longer wait for the replication offset after your last command, but rather the replication offset after your last write (or read command that caused propagation, e.g. lazy expiry). Unrelated changes: * WAIT command respects CLIENT_DENY_BLOCKING (not just CLIENT_MULTI) Implementation details: * Add an atomic var named `fsynced_reploff_pending` that's updated (usually by the bio thread) and later copied to the main `fsynced_reploff` variable (only if the AOF base file exists). I.e. during the initial AOF rewrite it will not be used as the fsynced offset since the AOF base is still missing. * Replace close+fsync bio job with new BIO_CLOSE_AOF (AOF specific) job that will also update fsync offset the field. * Handle all AOF jobs (BIO_CLOSE_AOF, BIO_AOF_FSYNC) in the same bio worker thread, to impose ordering on their execution. This solves a race condition where a job could set `fsynced_reploff_pending` to a higher value than another pending fsync job, resulting in indicating an offset for which parts of the data have not yet actually been fsynced. Imposing an ordering on the jobs guarantees that fsync jobs are executed in increasing order of replication offset. * Drain bio jobs when switching `appendfsync` to "always" This should prevent a write race between updates to `fsynced_reploff_pending` in the main thread (`flushAppendOnlyFile` when set to ALWAYS fsync), and those done in the bio thread. * Drain the pending fsync when starting over a new AOF to avoid race conditions with the previous AOF offsets overriding the new one (e.g. after switching to replicate from a new master). * Make sure to update the fsynced offset at the end of the initial AOF rewrite. a must in case there are no additional writes that trigger a periodic fsync, specifically for a replica that does a full sync. Limitations: It is possible to write a module and a Lua script that propagate to the AOF and doesn't propagate to the replication stream. see REDISMODULE_ARGV_NO_REPLICAS and luaRedisSetReplCommand. These features are incompatible with the WAITAOF command, and can result in two bad cases. The scenario is that the user executes command that only propagates to AOF, and then immediately issues a WAITAOF, and there's no further writes on the replication stream after that. 1. if the the last thing that happened on the replication stream is a PING (which increased the replication offset but won't trigger an fsync on the replica), then the client would hang forever (will wait for an fack that the replica will never send sine it doesn't trigger any fsyncs). 2. if the last thing that happened is a write command that got propagated properly, then WAITAOF will be released immediately, without waiting for an fsync (since the offset didn't change) Refactoring: * Plumbing to allow bio worker to handle multiple job types This introduces infrastructure necessary to allow BIO workers to not have a 1-1 mapping of worker to job-type. This allows in the future to assign multiple job types to a single worker, either as a performance/resource optimization, or as a way of enforcing ordering between specific classes of jobs. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-14 19:26:21 +01:00
void bioCreateCloseAofJob(int fd, long long offset, int need_reclaim_cache);
void bioCreateFsyncJob(int fd, long long offset, int need_reclaim_cache);
void bioCreateLazyFreeJob(lazy_free_fn free_fn, int arg_count, ...);
Change FLUSHALL/FLUSHDB SYNC to run as blocking ASYNC (#13167) # Overview Users utilize the `FLUSHDB SYNC` and `FLUSHALL SYNC` commands for a variety of reasons. The main issue with this command is that if the database becomes substantial in size, the server will be unresponsive for an extended period. Other than freezing application traffic, this may also lead some clients making incorrect judgments about the server's availability. For instance, a watchdog may erroneously decide to terminate the process, resulting in potential adverse outcomes. While a `FLUSH* ASYNC` can address these issues, it might not be used for two reasons: firstly, it's not the default, and secondly, in some cases, the client issuing the flush wants to wait for its completion before repopulating the database. Between the option of triggering FLUSH* asynchronously in the background without indication for completion versus running it synchronously in the foreground by the main thread, there is another more appealing option. We can block the client that requested the flush, execute the flush command in the background, and once done, unblock the client and return notification for completion. This approach ensures the server remains responsive to other clients, and the blocked client receives the expected response only after the flush operation has been successfully carried out. # Implementation details Instead of defining yet another flavor to the flush command, we can modify `FLUSHALL SYNC` and `FLUSHDB SYNC` always run in this new mode. ## Extending BIO Threads capabilities Today jobs that are carried out by BIO threads don't have the capability to indicate completion to the main thread. We can add this infrastructure by having an additional dummy job, coined as completion-job, that eventually will be written by BIO threads to a response-queue. The main thread will take care to consume items from the response-queue and call the provided callback function of each completion-job. ## FLUSH* SYNC to run as blocking ASYNC Command `FLUSH* SYNC` will be modified to create one or more async jobs to flush DB(s) and afterward will push additional completion-job request. By sending the completion job request only at the end, the main thread will be called back only after all the preceding jobs completed their task in the background. During that time, the client of the command is suspended and marked as `BLOCKED_LAZYFREE` whereas any other client will be able to communicate with the server without any issue.
2024-04-02 14:09:52 +02:00
void bioCreateCompRq(bio_worker_t assigned_worker, comp_fn *func, uint64_t user_data);
2011-09-13 15:59:48 +02:00
2019-11-21 02:18:00 +01:00
#endif