Commit Graph

89 Commits

Author SHA1 Message Date
Pawel Jakub Dawidek
dfb1aece41 fork(2) returns -1 on failure, not some random negative number.
MFC after:	3 days
2012-01-06 23:44:26 +00:00
Pawel Jakub Dawidek
07ebc3626e Implement 'async' mode for HAST.
MFC after:	3 days
2011-10-27 20:32:57 +00:00
Pawel Jakub Dawidek
3f5bce1822 Minor cleanups.
MFC after:	3 days
2011-10-27 20:15:37 +00:00
Pawel Jakub Dawidek
43b8675beb Reduce indentation.
MFC after:	3 days
2011-10-27 20:13:39 +00:00
Pawel Jakub Dawidek
5a58d22a84 Improve comment so it doesn't suggest race is possible, but that we handle
the race.

MFC after:	3 days
2011-10-27 20:10:21 +00:00
Pawel Jakub Dawidek
1212a85c4a Monor cleanups.
MFC after:	3 days
2011-10-27 18:49:16 +00:00
Pawel Jakub Dawidek
8a34134ac2 Delay resuid generation until first connection to secondary, not until first
write. This way on first connection we will synchronize only the extents that
were modified during the lifetime of primary node, not entire GEOM provider.

MFC after:	3 days
2011-10-27 18:45:01 +00:00
Pawel Jakub Dawidek
e3feec94eb Correct typo.
MFC after:	3 days
2011-09-28 13:25:27 +00:00
Pawel Jakub Dawidek
12daf727f6 If the underlying provider doesn't support BIO_FLUSH, log it only once
and don't bother trying in the future.

MFC after:	3 days
2011-09-28 13:19:47 +00:00
Pawel Jakub Dawidek
518dd4c0d9 After every activemap change flush disk's write cache, so that write
reordering won't make the actual write to be committed before marking
the coresponding extent as dirty.

It can be disabled in configuration file.

If BIO_FLUSH is not supported by the underlying file system we log a warning
and never send BIO_FLUSH again to that GEOM provider.

MFC after:	3 days
2011-09-28 13:08:51 +00:00
Pawel Jakub Dawidek
be1143efb9 No need to wrap pjdlog functions around with KEEP_ERRNO() macro.
MFC after:	3 days
2011-09-27 08:26:09 +00:00
Pawel Jakub Dawidek
09c2e8431a Correct two mistakes when converting asserts to PJDLOG_ASSERT()/PJDLOG_ABORT().
MFC after:	3 days
2011-09-27 07:59:10 +00:00
Mikolaj Golub
adca96f861 Fix indentation.
Approved by:	pjd (mentor)
2011-07-13 05:32:55 +00:00
Mikolaj Golub
d9f039e0b3 Check the returned value of activemap_write_complete() and update matadata on
disk if needed. This should fix a potential case when extents are cleared in
activemap but metadata is not updated on disk.

Suggested by:	pjd
Approved by:	pjd (mentor)
2011-06-28 21:01:32 +00:00
Mikolaj Golub
ba2a822490 In HAST we use two sockets - one for only sending the data and one for
only receiving the data. In r220271 the unused directions were
disabled using shutdown(2).

Unfortunately, this broke automatic receive buffer sizing, which
currently works only for connections in ETASBLISHED state. It was a
root cause of the issue reported by users, when connection between
primary and secondary could get stuck.

Disable the code introduced in r220271 until the issue with automatic
buffer sizing is not resolved.

Reported by:	Daniel Kalchev <daniel@digsys.bg>, danger, sobomax
Tested by:	Daniel Kalchev <daniel@digsys.bg>, danger
Approved by:	pjd (mentor)
MFC after:	1 week
2011-06-17 07:07:26 +00:00
Mikolaj Golub
a01a750f32 If READ from the local node failed we send the request to the remote
node. There is no use in doing this for synchronization requests.

Approved by:	pjd (mentor)
MFC after:	1 week
2011-05-29 21:20:47 +00:00
Pawel Jakub Dawidek
3db86c39ae Keep statistics on number of BIO_READ, BIO_WRITE, BIO_DELETE and BIO_FLUSH
requests as well as number of activemap updates.

Number of BIO_WRITEs and activemap updates are especially interesting, because
if those two are too close to each other, it means that your workload needs
bigger number of dirty extents. Activemap should be updated as rarely as
possible.

MFC after:	1 week
2011-05-23 21:15:19 +00:00
Pawel Jakub Dawidek
0cddb12ffd Currently we are unable to use capsicum for the primary worker process,
because we need to do ioctl(2)s, which are not permitted in the capability
mode. What we do now is to chroot(2) to /var/empty, which restricts access
to file system name space and we drop privileges to hast user and hast
group.

This still allows to access to other name spaces, like list of processes,
network and sysvipc.

To address that, use jail(2) instead of chroot(2). Using jail(2) will restrict
access to process table, network (we use ip-less jails) and sysvipc (if
security.jail.sysvipc_allowed is turned off). This provides much better
separation.

MFC after:	1 week
2011-05-14 17:02:03 +00:00
Pawel Jakub Dawidek
ac0401e321 When we become primary, we connect to the remote and expect it to be in
secondary role. It is possible that the remote node is primary, but only
because there was a role change and it didn't finish cleaning up (unmounting
file systems, etc.). If we detect such situation, wait for the remote node
to switch the role to secondary before accepting I/Os. If we don't wait for
it in that case, we will most likely cause split-brain.

MFC after:	1 week
2011-04-20 18:43:28 +00:00
Pawel Jakub Dawidek
06cbf54941 Scenario:
- We have two nodes connected and synchronized (local counters on both sides
  are 0).
- We take secondary down and recreate it.
- Primary connects to it and starts synchronization (but local counters are
  still 0).
- We switch the roles.
- Synchronization restarts but data is synchronized now from new primary
  (because local counters are 0) that doesn't have new data yet.

This fix this issue we bump local counter on primary when we discover that
connected secondary was recreated and has no data yet.

Reported by:	trociny
Discussed with:	trociny
Tested by:	trociny
MFC after:	1 week
2011-04-19 19:26:27 +00:00
Pawel Jakub Dawidek
02dfe9724c Declare directions for sockets between primary and secondary.
In HAST we use two sockets - one for only sending the data and one for only
receiving the data.

MFC after:	1 month
2011-04-02 09:25:13 +00:00
Pawel Jakub Dawidek
2a49afacd1 Handle the problem described in r220264 by using GEOM GATE queue of unlimited
length. This should fix deadlocks reported by HAST users.

MFC after:	1 week
2011-04-02 07:01:09 +00:00
Pawel Jakub Dawidek
7d4df5cd0b Use timeout from configuration file not only when sending and receiving,
but also when establishing connection.

MFC after:	1 week
2011-03-25 20:15:16 +00:00
Pawel Jakub Dawidek
643080b75f Use role2str() when setting process title.
MFC after:	1 week
2011-03-25 20:13:38 +00:00
Mikolaj Golub
9237aa3fa5 After synchronization is complete we should make primary counters be
equal to secondary counters:

  primary_localcnt = secondary_remotecnt
  primary_remotecnt = secondary_localcnt

Previously it was done wrong and split-brain was observed after
primary had synchronized up-to-date data from secondary.

Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-22 20:27:26 +00:00
Mikolaj Golub
b068d5aafb For requests that are sent only to remote component use the
error from remote.
Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-22 19:49:27 +00:00
Pawel Jakub Dawidek
cd72d521e3 White space cleanups.
MFC after:	1 week
2011-03-22 10:39:34 +00:00
Pawel Jakub Dawidek
4d8dc3b838 When dropping privileges prefer capsicum over chroot+setgid+setuid.
We can use capsicum for secondary worker processes and hastctl.
When working as primary we drop privileges using chroot+setgid+setuid
still as we need to send ioctl(2)s to ggate device, for which capsicum
doesn't allow (yet).

X-MFC after:	capsicum is merged to stable/8
2011-03-21 21:31:50 +00:00
Pawel Jakub Dawidek
9446b4536e Initialize localcnt on first write. This fixes assertion when we create
resource, set role to primary, do no writes, then sent it to secondary
and accept connection from primary.

MFC after:	1 week
2011-03-21 21:16:12 +00:00
Pawel Jakub Dawidek
0b626a289e In hast.conf we define the other node's address in 'remote' variable.
This way we know how to connect to secondary node when we are primary.
The same variable is used by the secondary node - it only accepts
connections from the address stored in 'remote' variable.
In cluster configurations it is common that each node has its individual
IP address and there is one addtional shared IP address which is assigned
to primary node. It seems it is possible that if the shared IP address is
from the same network as the individual IP address it might be choosen by
the kernel as a source address for connection with the secondary node.
Such connection will be rejected by secondary, as it doesn't come from
primary node individual IP.

Add 'source' variable that allows to specify source IP address we want to
bind to before connecting to the secondary node.

MFC after:	1 week
2011-03-21 08:54:59 +00:00
Mikolaj Golub
8d7dcf14ff For secondary, set 2 * HAST_KEEPALIVE seconds timeout for incoming
connection so the worker will exit if it does not receive packets from
the primary during this interval.

Reported by:	Christian Vogt <Christian.Vogt@haw-hamburg.de>
Tested by:	Christian Vogt <Christian.Vogt@haw-hamburg.de>
Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-17 21:02:14 +00:00
Mikolaj Golub
bc7a916a25 Make workers inherit debug level from the main process.
Approved by:	pjd (mentor)
MFC after:	1 week
2011-03-11 12:12:35 +00:00
Pawel Jakub Dawidek
fa356f6cfe - Log size of data to synchronize in human readable form (using %N).
- Log synchronization time (using %T).
- Log synchronization speed in human readable form (using %N).

MFC after:	2 weeks
2011-03-07 10:41:12 +00:00
Pawel Jakub Dawidek
8cd3d45ad9 Allow to compress on-the-wire data using two algorithms:
- HOLE - it simply turns all-zero blocks into few bytes header;
	it is extremely fast, so it is turned on by default;
	it is mostly intended to speed up initial synchronization
	where we expect many zeros;
- LZF - very fast algorithm by Marc Alexander Lehmann, which shows
	very decent compression ratio and has BSD license.

MFC after:	2 weeks
2011-03-06 23:09:33 +00:00
Pawel Jakub Dawidek
1fee97b01f Allow to checksum on-the-wire data using either CRC32 or SHA256.
MFC after:	2 weeks
2011-03-06 22:56:14 +00:00
Pawel Jakub Dawidek
32ecf62028 Setup another socketpair between parent and child, so that primary sandboxed
worker can ask the main privileged process to connect in worker's behalf
and then we can migrate descriptor using this socketpair to worker.
This is not really needed now, but will be needed once we start to use
capsicum for sandboxing.

MFC after:	1 week
2011-02-03 11:39:49 +00:00
Pawel Jakub Dawidek
21e7bc5e52 Add missing locking after moving keepalive_send() to remote send thread
in r214692.

MFC after:	1 week
2011-02-03 11:33:32 +00:00
Pawel Jakub Dawidek
f4c96f944c Let the caller log info about successful privilege drop.
We don't want to log this in hastctl.

MFC after:	1 week
2011-02-03 10:37:44 +00:00
Pawel Jakub Dawidek
9d70b24b93 Allow to specify connection timeout by the caller.
MFC after:	1 week
2011-02-02 15:42:00 +00:00
Pawel Jakub Dawidek
2ec483c58e - Use pjdlog for assertions and aborts as this will log assert/abort message
to syslog if we run in background.
- Asserts in proto.c that method we want to call is implemented and remove
  dummy methods from protocols implementation that are only there to abort
  the program with nice message.

MFC after:	1 week
2011-01-31 18:32:17 +00:00
Pawel Jakub Dawidek
6d7967de8a Drop privileges in worker processes.
Accepting connections and handshaking in secondary is still done before
dropping privileges. It should be implemented by only accepting connections in
privileged main process and passing connection descriptors to the worker, but
is not implemented yet.

MFC after:	1 week
2011-01-28 22:35:46 +00:00
Pawel Jakub Dawidek
f463896e5e Use newly added descriptors_assert() function to ensure only expected
descriptors are open.

MFC after:	1 week
2011-01-28 21:57:42 +00:00
Pawel Jakub Dawidek
da1783ea29 Close all unneeded descriptors after fork(2).
MFC after:	1 week
2011-01-28 21:52:37 +00:00
Pawel Jakub Dawidek
d64c0992e4 Add comments to places where we treat errors as ciritical, but it is possible
to handle them more gracefully.

MFC after:	1 week
2011-01-28 21:51:40 +00:00
Pawel Jakub Dawidek
115f4e5c3e Don't open configuration file from worker process. Handle SIGHUP in the
master process only and pass changes to the worker processes over control
socket. This removes access to global namespace in preparation for capsicum
sandboxing.

MFC after:	2 weeks
2011-01-24 15:04:15 +00:00
Pawel Jakub Dawidek
fba1bf5a2c The 'ret' variable is of type ssize_t and we use proper format for it (%zd), so
no (bogus) cast is needed.

MFC after:	3 days
2010-12-16 19:48:03 +00:00
Pawel Jakub Dawidek
cd7b7ee577 Improve problems logging.
MFC after:	3 days
2010-12-16 07:30:47 +00:00
Pawel Jakub Dawidek
7208920499 Don't ignore errors from remote requests.
MFC after:	3 days
2010-12-16 07:29:58 +00:00
Pawel Jakub Dawidek
d448536ceb Move timeout.tv_sec initialization outside the loop - sigtimedwait(2) won't
modify it.

Submitted by:	Mikolaj Golub <to.my.trociny@gmail.com>
MFC after:	3 days
2010-11-15 03:07:42 +00:00
Pawel Jakub Dawidek
1dd5a4bfa2 1. Exit when we cannot create incoming connection.
2. Improve logging to inform which connection can't be created.

Submitted by:	[1] Mikolaj Golub <to.my.trociny@gmail.com>
MFC after:	3 days
2010-11-15 03:05:33 +00:00