write. This way on first connection we will synchronize only the extents that
were modified during the lifetime of primary node, not entire GEOM provider.
MFC after: 3 days
reordering won't make the actual write to be committed before marking
the coresponding extent as dirty.
It can be disabled in configuration file.
If BIO_FLUSH is not supported by the underlying file system we log a warning
and never send BIO_FLUSH again to that GEOM provider.
MFC after: 3 days
disk if needed. This should fix a potential case when extents are cleared in
activemap but metadata is not updated on disk.
Suggested by: pjd
Approved by: pjd (mentor)
stating if we need to update activemap on disk. This makes keepdirty
serve its purpose -- to reduce number of metadata updates.
Discussed with: pjd
Approved by: pjd (mentor)
only receiving the data. In r220271 the unused directions were
disabled using shutdown(2).
Unfortunately, this broke automatic receive buffer sizing, which
currently works only for connections in ETASBLISHED state. It was a
root cause of the issue reported by users, when connection between
primary and secondary could get stuck.
Disable the code introduced in r220271 until the issue with automatic
buffer sizing is not resolved.
Reported by: Daniel Kalchev <daniel@digsys.bg>, danger, sobomax
Tested by: Daniel Kalchev <daniel@digsys.bg>, danger
Approved by: pjd (mentor)
MFC after: 1 week
sending. What happens otherwise is that the sender splits all the
traffic into 32k chunks, while the receiver is waiting for the whole
packet. Then for a certain packet sizes, particularly 66607 bytes in
my case, the communication stucks to secondary is expecting to
read one chunk of 66607 bytes, while primary is sending two chunks
of 32768 bytes and third chunk of 1071. Probably due to TCP windowing
and buffering the final chunk gets stuck somewhere, so neither server
not client can make any progress.
This patch also protect from short reads, as according to the manual
page there are some cases when MSG_WAITALL can give less data than
expected.
MFC after: 3 days
requests as well as number of activemap updates.
Number of BIO_WRITEs and activemap updates are especially interesting, because
if those two are too close to each other, it means that your workload needs
bigger number of dirty extents. Activemap should be updated as rarely as
possible.
MFC after: 1 week
because we need to do ioctl(2)s, which are not permitted in the capability
mode. What we do now is to chroot(2) to /var/empty, which restricts access
to file system name space and we drop privileges to hast user and hast
group.
This still allows to access to other name spaces, like list of processes,
network and sysvipc.
To address that, use jail(2) instead of chroot(2). Using jail(2) will restrict
access to process table, network (we use ip-less jails) and sysvipc (if
security.jail.sysvipc_allowed is turned off). This provides much better
separation.
MFC after: 1 week
hastd process and workers, remove unused one and set different range
of numbers. This is done in order not to confuse them with HASTCTL_CMD
defines, used for conversation between hastctl and hastd, and to avoid
bugs like the one fixed in in r221075.
Approved by: pjd (mentor)
MFC after: 1 week
secondary role. It is possible that the remote node is primary, but only
because there was a role change and it didn't finish cleaning up (unmounting
file systems, etc.). If we detect such situation, wait for the remote node
to switch the role to secondary before accepting I/Os. If we don't wait for
it in that case, we will most likely cause split-brain.
MFC after: 1 week
- We have two nodes connected and synchronized (local counters on both sides
are 0).
- We take secondary down and recreate it.
- Primary connects to it and starts synchronization (but local counters are
still 0).
- We switch the roles.
- Synchronization restarts but data is synchronized now from new primary
(because local counters are 0) that doesn't have new data yet.
This fix this issue we bump local counter on primary when we discover that
connected secondary was recreated and has no data yet.
Reported by: trociny
Discussed with: trociny
Tested by: trociny
MFC after: 1 week
hast_proto_recv_hdr() may be used. This also fixes the issue
(introduced by r220523) with hastctl, which crashed on assert in
hast_proto_recv_data().
Suggested and approved by: pjd (mentor)
this means that request timed out. Translate the meaningless EAGAIN to
ETIMEDOUT to give administrator a hint that he might need to increase timeout
in configuration file.
MFC after: 1 month
Before it could change later and we were sending invalid mapsize.
Some time ago I added optimization where when nodes are connected for the
first time and there were no writes to them yet, there is no initial full
synchronization. This bug prevented it from working.
MFC after: 1 week
equal to secondary counters:
primary_localcnt = secondary_remotecnt
primary_remotecnt = secondary_localcnt
Previously it was done wrong and split-brain was observed after
primary had synchronized up-to-date data from secondary.
Approved by: pjd (mentor)
MFC after: 1 week
We can use capsicum for secondary worker processes and hastctl.
When working as primary we drop privileges using chroot+setgid+setuid
still as we need to send ioctl(2)s to ggate device, for which capsicum
doesn't allow (yet).
X-MFC after: capsicum is merged to stable/8
our info about worker processes if any of them was terminated in the meantime.
This fixes the problem with 'hastctl status' running from a hook called on
split-brain:
1. Secondary calls a hooks and terminates.
2. Hook asks for resource status via 'hastctl status'.
3. The main hastd handles the status request by sending it to the secondary
worker who is already dead, but because signals weren't checked yet he
doesn't know that and we get EPIPE.
MFC after: 1 week
This way we know how to connect to secondary node when we are primary.
The same variable is used by the secondary node - it only accepts
connections from the address stored in 'remote' variable.
In cluster configurations it is common that each node has its individual
IP address and there is one addtional shared IP address which is assigned
to primary node. It seems it is possible that if the shared IP address is
from the same network as the individual IP address it might be choosen by
the kernel as a source address for connection with the secondary node.
Such connection will be rejected by secondary, as it doesn't come from
primary node individual IP.
Add 'source' variable that allows to specify source IP address we want to
bind to before connecting to the secondary node.
MFC after: 1 week
connection so the worker will exit if it does not receive packets from
the primary during this interval.
Reported by: Christian Vogt <Christian.Vogt@haw-hamburg.de>
Tested by: Christian Vogt <Christian.Vogt@haw-hamburg.de>
Approved by: pjd (mentor)
MFC after: 1 week
- Load support for %T for pritning time.
- Add support for %N for printing number in human readable form.
- Add support for %S for printing sockaddr structure (currently only AF_INET
family is supported, as this is all we need in HAST).
- Disable gcc compile-time format checking as this will no longer work.
MFC after: 2 weeks
- HOLE - it simply turns all-zero blocks into few bytes header;
it is extremely fast, so it is turned on by default;
it is mostly intended to speed up initial synchronization
where we expect many zeros;
- LZF - very fast algorithm by Marc Alexander Lehmann, which shows
very decent compression ratio and has BSD license.
MFC after: 2 weeks
1. The descriptor is the one we are listening on (not the one when we connect
as a client and not the one which is created on accept(2)).
2. Descriptor was created by us (PID matches with the PID stored on bind(2)).
Reported by: Mikolaj Golub <to.my.trociny@gmail.com>
MFC after: 1 week
worker can ask the main privileged process to connect in worker's behalf
and then we can migrate descriptor using this socketpair to worker.
This is not really needed now, but will be needed once we start to use
capsicum for sandboxing.
MFC after: 1 week
proto_connection_{send,recv} and change them to return proto_conn
structure. We don't operate directly on descriptors, but on
proto_conns.
- Add wrap method to wrap descriptor with proto_conn.
- Remove methods to send and receive descriptors and implement this
functionality as additional argument to send and receive methods.
MFC after: 1 week
If timeout argument to proto_connect() is -1, then the caller needs to use
this new function to wait for connection.
This change is in preparation for capsicum, where sandboxed worker wants
to ask main process to connect in worker's behalf and pass descriptor
to the worker. Because we don't want the main process to wait for the
connection, it will start async connection and pass descriptor to the
worker who will be responsible for waiting for the connection to finish.
MFC after: 1 week
to syslog if we run in background.
- Asserts in proto.c that method we want to call is implemented and remove
dummy methods from protocols implementation that are only there to abort
the program with nice message.
MFC after: 1 week
Accepting connections and handshaking in secondary is still done before
dropping privileges. It should be implemented by only accepting connections in
privileged main process and passing connection descriptors to the worker, but
is not implemented yet.
MFC after: 1 week
- chrooting to /var/empty (user hast home directory),
- setting groups to 'hast' (user hast primary group),
- setting real group id, effective group id and saved group id to 'hast',
- setting real user id, effective user id and saved user id to 'hast'.
At the end verify that those operations where successfull.
MFC after: 1 week
we expect to be open. Also assert that they point at expected type.
Because openlog(3) API is unable to tell us descriptor number it is using, we
have to close syslog socket, remember assert message in local buffer and if we
fail on assertion, reopen syslog socket and log the message.
MFC after: 1 week
PJDLOG_RVERIFY() - always check expression and on false log the given message
and exit.
PJDLOG_RASSERT() - check expression when NDEBUG is not defined and on false log
given message and exit.
PJDLOG_ABORT() - log the given message and exit.
MFC after: 1 week
master process only and pass changes to the worker processes over control
socket. This removes access to global namespace in preparation for capsicum
sandboxing.
MFC after: 2 weeks