Commit Graph

88 Commits

Author SHA1 Message Date
pjd
14cf798458 Implement 'async' mode for HAST.
MFC after:	3 days
2011-10-27 20:32:57 +00:00
pjd
c017e98c55 Minor cleanups.
MFC after:	3 days
2011-10-27 20:15:37 +00:00
pjd
739f931164 Reduce indentation.
MFC after:	3 days
2011-10-27 20:13:39 +00:00
pjd
986d3757ae Improve comment so it doesn't suggest race is possible, but that we handle
the race.

MFC after:	3 days
2011-10-27 20:10:21 +00:00
pjd
fcce680c47 Monor cleanups.
MFC after:	3 days
2011-10-27 18:49:16 +00:00
pjd
c2e715af70 Delay resuid generation until first connection to secondary, not until first
write. This way on first connection we will synchronize only the extents that
were modified during the lifetime of primary node, not entire GEOM provider.

MFC after:	3 days
2011-10-27 18:45:01 +00:00
pjd
ae2bc06327 Correct typo.
MFC after:	3 days
2011-09-28 13:25:27 +00:00
pjd
25b3d91c28 If the underlying provider doesn't support BIO_FLUSH, log it only once
and don't bother trying in the future.

MFC after:	3 days
2011-09-28 13:19:47 +00:00
pjd
374501b495 After every activemap change flush disk's write cache, so that write
reordering won't make the actual write to be committed before marking
the coresponding extent as dirty.

It can be disabled in configuration file.

If BIO_FLUSH is not supported by the underlying file system we log a warning
and never send BIO_FLUSH again to that GEOM provider.

MFC after:	3 days
2011-09-28 13:08:51 +00:00
pjd
1f93bdc27f No need to wrap pjdlog functions around with KEEP_ERRNO() macro.
MFC after:	3 days
2011-09-27 08:26:09 +00:00
pjd
b294ac14be Correct two mistakes when converting asserts to PJDLOG_ASSERT()/PJDLOG_ABORT().
MFC after:	3 days
2011-09-27 07:59:10 +00:00
trociny
ec6755c2ee Fix indentation.
Approved by:	pjd (mentor)
2011-07-13 05:32:55 +00:00
trociny
82faa3e641 Check the returned value of activemap_write_complete() and update matadata on
disk if needed. This should fix a potential case when extents are cleared in
activemap but metadata is not updated on disk.

Suggested by:	pjd
Approved by:	pjd (mentor)
2011-06-28 21:01:32 +00:00
trociny
a262e24ff4 In HAST we use two sockets - one for only sending the data and one for
only receiving the data. In r220271 the unused directions were
disabled using shutdown(2).

Unfortunately, this broke automatic receive buffer sizing, which
currently works only for connections in ETASBLISHED state. It was a
root cause of the issue reported by users, when connection between
primary and secondary could get stuck.

Disable the code introduced in r220271 until the issue with automatic
buffer sizing is not resolved.

Reported by:	Daniel Kalchev <daniel@digsys.bg>, danger, sobomax
Tested by:	Daniel Kalchev <daniel@digsys.bg>, danger
Approved by:	pjd (mentor)
MFC after:	1 week
2011-06-17 07:07:26 +00:00
trociny
e74ba5594f If READ from the local node failed we send the request to the remote
node. There is no use in doing this for synchronization requests.

Approved by:	pjd (mentor)
MFC after:	1 week
2011-05-29 21:20:47 +00:00
pjd
42a14e17b5 Keep statistics on number of BIO_READ, BIO_WRITE, BIO_DELETE and BIO_FLUSH
requests as well as number of activemap updates.

Number of BIO_WRITEs and activemap updates are especially interesting, because
if those two are too close to each other, it means that your workload needs
bigger number of dirty extents. Activemap should be updated as rarely as
possible.

MFC after:	1 week
2011-05-23 21:15:19 +00:00
pjd
eccd4beb31 Currently we are unable to use capsicum for the primary worker process,
because we need to do ioctl(2)s, which are not permitted in the capability
mode. What we do now is to chroot(2) to /var/empty, which restricts access
to file system name space and we drop privileges to hast user and hast
group.

This still allows to access to other name spaces, like list of processes,
network and sysvipc.

To address that, use jail(2) instead of chroot(2). Using jail(2) will restrict
access to process table, network (we use ip-less jails) and sysvipc (if
security.jail.sysvipc_allowed is turned off). This provides much better
separation.

MFC after:	1 week
2011-05-14 17:02:03 +00:00
pjd
e272ff3c6f When we become primary, we connect to the remote and expect it to be in
secondary role. It is possible that the remote node is primary, but only
because there was a role change and it didn't finish cleaning up (unmounting
file systems, etc.). If we detect such situation, wait for the remote node
to switch the role to secondary before accepting I/Os. If we don't wait for
it in that case, we will most likely cause split-brain.

MFC after:	1 week
2011-04-20 18:43:28 +00:00
pjd
c3d7e4b40e Scenario:
- We have two nodes connected and synchronized (local counters on both sides
  are 0).
- We take secondary down and recreate it.
- Primary connects to it and starts synchronization (but local counters are
  still 0).
- We switch the roles.
- Synchronization restarts but data is synchronized now from new primary
  (because local counters are 0) that doesn't have new data yet.

This fix this issue we bump local counter on primary when we discover that
connected secondary was recreated and has no data yet.

Reported by:	trociny
Discussed with:	trociny
Tested by:	trociny
MFC after:	1 week
2011-04-19 19:26:27 +00:00
pjd
52d273ec99 Declare directions for sockets between primary and secondary.
In HAST we use two sockets - one for only sending the data and one for only
receiving the data.

MFC after:	1 month
2011-04-02 09:25:13 +00:00
pjd
02471b97ed Handle the problem described in r220264 by using GEOM GATE queue of unlimited
length. This should fix deadlocks reported by HAST users.

MFC after:	1 week
2011-04-02 07:01:09 +00:00
pjd
7c9a800121 Use timeout from configuration file not only when sending and receiving,
but also when establishing connection.

MFC after:	1 week
2011-03-25 20:15:16 +00:00
pjd
f6f9894d9f Use role2str() when setting process title.
MFC after:	1 week
2011-03-25 20:13:38 +00:00
trociny
9a2a9da631 After synchronization is complete we should make primary counters be
equal to secondary counters:

  primary_localcnt = secondary_remotecnt
  primary_remotecnt = secondary_localcnt

Previously it was done wrong and split-brain was observed after
primary had synchronized up-to-date data from secondary.

Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-22 20:27:26 +00:00
trociny
03d954b2b3 For requests that are sent only to remote component use the
error from remote.
Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-22 19:49:27 +00:00
pjd
f29604a547 White space cleanups.
MFC after:	1 week
2011-03-22 10:39:34 +00:00
pjd
b84a0251e3 When dropping privileges prefer capsicum over chroot+setgid+setuid.
We can use capsicum for secondary worker processes and hastctl.
When working as primary we drop privileges using chroot+setgid+setuid
still as we need to send ioctl(2)s to ggate device, for which capsicum
doesn't allow (yet).

X-MFC after:	capsicum is merged to stable/8
2011-03-21 21:31:50 +00:00
pjd
a444cd5681 Initialize localcnt on first write. This fixes assertion when we create
resource, set role to primary, do no writes, then sent it to secondary
and accept connection from primary.

MFC after:	1 week
2011-03-21 21:16:12 +00:00
pjd
3420a73611 In hast.conf we define the other node's address in 'remote' variable.
This way we know how to connect to secondary node when we are primary.
The same variable is used by the secondary node - it only accepts
connections from the address stored in 'remote' variable.
In cluster configurations it is common that each node has its individual
IP address and there is one addtional shared IP address which is assigned
to primary node. It seems it is possible that if the shared IP address is
from the same network as the individual IP address it might be choosen by
the kernel as a source address for connection with the secondary node.
Such connection will be rejected by secondary, as it doesn't come from
primary node individual IP.

Add 'source' variable that allows to specify source IP address we want to
bind to before connecting to the secondary node.

MFC after:	1 week
2011-03-21 08:54:59 +00:00
trociny
66e5107b57 For secondary, set 2 * HAST_KEEPALIVE seconds timeout for incoming
connection so the worker will exit if it does not receive packets from
the primary during this interval.

Reported by:	Christian Vogt <Christian.Vogt@haw-hamburg.de>
Tested by:	Christian Vogt <Christian.Vogt@haw-hamburg.de>
Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-17 21:02:14 +00:00
trociny
a3ae0953aa Make workers inherit debug level from the main process.
Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-11 12:12:35 +00:00
pjd
62e5e79029 - Log size of data to synchronize in human readable form (using %N).
- Log synchronization time (using %T).
- Log synchronization speed in human readable form (using %N).

MFC after:	2 weeks
2011-03-07 10:41:12 +00:00
pjd
337b50efa8 Allow to compress on-the-wire data using two algorithms:
- HOLE - it simply turns all-zero blocks into few bytes header;
	it is extremely fast, so it is turned on by default;
	it is mostly intended to speed up initial synchronization
	where we expect many zeros;
- LZF - very fast algorithm by Marc Alexander Lehmann, which shows
	very decent compression ratio and has BSD license.

MFC after:	2 weeks
2011-03-06 23:09:33 +00:00
pjd
f56b79fee1 Allow to checksum on-the-wire data using either CRC32 or SHA256.
MFC after:	2 weeks
2011-03-06 22:56:14 +00:00
pjd
d2daebca5a Setup another socketpair between parent and child, so that primary sandboxed
worker can ask the main privileged process to connect in worker's behalf
and then we can migrate descriptor using this socketpair to worker.
This is not really needed now, but will be needed once we start to use
capsicum for sandboxing.

MFC after:	1 week
2011-02-03 11:39:49 +00:00
pjd
0c525303bd Add missing locking after moving keepalive_send() to remote send thread
in r214692.

MFC after:	1 week
2011-02-03 11:33:32 +00:00
pjd
c7493a8a85 Let the caller log info about successful privilege drop.
We don't want to log this in hastctl.

MFC after:	1 week
2011-02-03 10:37:44 +00:00
pjd
3acb629cd2 Allow to specify connection timeout by the caller.
MFC after:	1 week
2011-02-02 15:42:00 +00:00
pjd
d916d2edb5 - Use pjdlog for assertions and aborts as this will log assert/abort message
to syslog if we run in background.
- Asserts in proto.c that method we want to call is implemented and remove
  dummy methods from protocols implementation that are only there to abort
  the program with nice message.

MFC after:	1 week
2011-01-31 18:32:17 +00:00
pjd
621f7543a9 Drop privileges in worker processes.
Accepting connections and handshaking in secondary is still done before
dropping privileges. It should be implemented by only accepting connections in
privileged main process and passing connection descriptors to the worker, but
is not implemented yet.

MFC after:	1 week
2011-01-28 22:35:46 +00:00
pjd
1c97582ecb Use newly added descriptors_assert() function to ensure only expected
descriptors are open.

MFC after:	1 week
2011-01-28 21:57:42 +00:00
pjd
16ad1c7c69 Close all unneeded descriptors after fork(2).
MFC after:	1 week
2011-01-28 21:52:37 +00:00
pjd
1c569578be Add comments to places where we treat errors as ciritical, but it is possible
to handle them more gracefully.

MFC after:	1 week
2011-01-28 21:51:40 +00:00
pjd
b9551ad06f Don't open configuration file from worker process. Handle SIGHUP in the
master process only and pass changes to the worker processes over control
socket. This removes access to global namespace in preparation for capsicum
sandboxing.

MFC after:	2 weeks
2011-01-24 15:04:15 +00:00
pjd
595eb1de94 The 'ret' variable is of type ssize_t and we use proper format for it (%zd), so
no (bogus) cast is needed.

MFC after:	3 days
2010-12-16 19:48:03 +00:00
pjd
04dabf8bdf Improve problems logging.
MFC after:	3 days
2010-12-16 07:30:47 +00:00
pjd
aa49a373c5 Don't ignore errors from remote requests.
MFC after:	3 days
2010-12-16 07:29:58 +00:00
pjd
ec5f555493 Move timeout.tv_sec initialization outside the loop - sigtimedwait(2) won't
modify it.

Submitted by:	Mikolaj Golub <to.my.trociny@gmail.com>
MFC after:	3 days
2010-11-15 03:07:42 +00:00
pjd
781dd0dfd8 1. Exit when we cannot create incoming connection.
2. Improve logging to inform which connection can't be created.

Submitted by:	[1] Mikolaj Golub <to.my.trociny@gmail.com>
MFC after:	3 days
2010-11-15 03:05:33 +00:00
pjd
c148a74821 Send packets to remote node only via the send thread to avoid possible
races - in this case a keepalive packet was send from wrong thread which
lead to connection dropping, because of corrupted packet.

Fix it by sending keepalive packets directly from the send thread.
As a bonus we now send keepalive packets only when connection is idle.

Submitted by:	Mikolaj Golub <to.my.trociny@gmail.com>
MFC after:	3 days
2010-11-02 22:13:08 +00:00