Commit Graph

403 Commits

Author SHA1 Message Date
Konstantin Belousov
44f3b9c787 Print more useful information about the transfer that trigger the assertion.
Other data is available with ddb command 'show pginfo'.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-10-21 16:17:46 +00:00
Gleb Smirnoff
255c1caae3 - Create kern.ipc.sendfile namespace, and put the new "readhead" OID
there as "kern.ipc.sendfile.readahead".
- Push all nsfbuf related tunables into MD code. Don't move them
  to new namespace in favor of POLA.

Reviewed by:	scottl
Approved by:	re (gjb)
2013-09-22 13:36:52 +00:00
Gleb Smirnoff
85fdd534cf Fix assertion in sendfile_readpage() to assert only the validity
of requested amount of data in a page. Move assertion down below
object unlock.

Approved by:	re (kib)
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-09-17 06:37:21 +00:00
Konstantin Belousov
64c5de5483 Fix build with gcc.
Build-tested by:	gjb
Approved by:	re (glebius)
2013-09-11 17:31:22 +00:00
Konstantin Belousov
227aaa86ed Implement sendfile(2) for the posix shared memory segment file descriptor,
in addition to the regular files.

Requested by:	alc
Discussed with:	emaste
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-11 06:41:15 +00:00
Dag-Erling Smørgrav
1a05c762b9 Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
Pawel Jakub Dawidek
547561f1b0 Style fixes. Most fixes are about not treating integers and pointers as
booleans.
2013-09-05 00:17:38 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Andre Oppermann
bb25e5ab00 Give (*ext_free) an int return value allowing for very sophisticated
external mbuf buffer management capabilities in the future.

For now only EXT_FREE_OK is defined with current legacy behavior.

Sponsored by:	The FreeBSD Foundation
2013-08-25 10:57:09 +00:00
Andre Oppermann
9a73687609 Add an mbuf pointer parameter to (*ext_free) to give the external
free function access to the mbuf the external memory was attached
to.

Mechanically adjust all users to include the mbuf parameter.

This fixes a long standing annoyance for external free functions.
Before one had to sacrifice one of the argument pointers for this.

Sponsored by:	The FreeBSD Foundation
2013-08-24 16:57:44 +00:00
Konstantin Belousov
f6d76b0ec1 Since the 253927, which removed the soft busy call for the sf page, it
does not make sense to wait for the soft busy state of the page to
drain.  The vm object lock is dropped immediately after, so the result
of the wait is invalidated.

It might make sense to not wait for the hard busy state as well,
esp. for the fully valid page, but this is postponed for now.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-23 14:50:03 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Xin LI
42d875a536 Fix build. 2013-08-17 00:25:11 +00:00
Gleb Smirnoff
ca04d21d5f Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
Gleb Smirnoff
90c35c1939 - Minor style(9) fix.
- Bring a comment up to date.
2013-08-13 13:40:31 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Attilio Rao
878a788734 Remove unnecessary soft busy of the page before to do vn_rdwr() in
kern_sendfile() which is unnecessary.
The page is already wired so it will not be subjected to pagefault.
The content cannot be effectively protected as it is full of races
already.
Multiple accesses to the same indexes are serialized through vn_rdwr().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc, jeff
Tested by:	pho
2013-08-04 15:56:19 +00:00
Scott Long
fcd9ff2c67 Another fix for r253823; retain the default of 1 readahead block for sendfile.
Submitted by:	glebius
Obtained from:	Netflix
MFC after:	3 days
2013-07-31 15:55:01 +00:00
Scott Long
fc4a5f052b Create a knob, kern.ipc.sfreadahead, that allows one to tune the amount of
readahead that sendfile() will do.  Default remains the same.

Obtained from:	Netflix
MFC after:	3 days
2013-07-30 23:26:05 +00:00
Andrey V. Elsukov
05d1f5bce0 Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with:	glebius
2013-07-15 06:16:57 +00:00
Konstantin Belousov
3328431dec Item 1 in r248830 causes earlier exits from the sendfile(2), before
all requested data was sent.  The reason is that xfsize <= 0 condition
must not be tested at all if space == loopbytes.  Otherwise, the done
is set to 1, and sendfile(2) is aborted too early.

Instead of moving the condition to exiting the inner loop after the
xfersize check, directly check for the completed transfer before the
testing of the available space in the socket buffer, and revert item 1
of r248830.  It is arguably another bug to sleep waiting for socket
buffer space (or return EAGAIN for non-blocking socket) if all bytes
are already transferred.

Reported by:	pho
Discussed with:	scottl, gibbs
Tested by:	scottl (stable/9 backport), pho
2013-05-09 16:05:51 +00:00
Jilles Tjoelker
da7d2afb6d Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
Konstantin Belousov
3d31767952 Eliminate the layering violation in the kern_sendfile(). When quering
the file size, use VOP_GETATTR() instead of accessing vnode vm_object
un_pager.vnp.vnp_size.

Take the shared vnode lock earlier to cover the added VOP_GETATTR()
call and, as consequence, the whole internal sendfile loop.  Reduce vm
object lock scope to not protect the local calculations.

Note that this is the last misuse of the vnp_size in the tree, the
others were removed from the ELF image activator by r230246.

Reviewed by:	alc
Tested by:	pho, bf (previous version)
MFC after:	1 week
2013-04-28 19:12:09 +00:00
Gleb Smirnoff
14658a80fe Don't compare unsigned socklen_t against < 0.
Reviewed by:	jhb
2013-04-19 13:40:13 +00:00
Scott Long
07dbf2c768 Several fixes and improvements to sendfile()
1.  If we wanted to send exactly as many bytes as the socket buffer is
    sized for, the inner loop of kern_sendfile() would see that the
    socket is full before seeing that it had no more bytes left to send.
    This would cause it to return EAGAIN to the caller instead of
    success.  Fix by changing the order that these conditions are tested.
2.  Simplify the calculation for the bytes to send in each iteration of
    the inner loop of kern_sendfile()
3.  Fix some calls with bogus arguments to sf_buf_ext().  These would
    only trigger on mbuf allocation failure, but would be hilariously
    bad if they did trigger.

Submitted by:	gibbs(3), andre(2)
Reviewed by:	emax, andre
Obtained from:	Netflix
MFC after:	1 week
2013-03-28 14:14:28 +00:00
Jilles Tjoelker
c2e3c52e0d Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.

The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.

The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.

For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.

Reviewed by:	kib
2013-03-19 20:58:17 +00:00
Gleb Smirnoff
93cfe76349 - Use m_get2() instead of hand allocating.
- No need for u_int cast here.

Sponsored by:	Nginx, Inc.
2013-03-15 10:17:24 +00:00
Gleb Smirnoff
3b4a84e757 In kern_sendfile() use m_extadd() instead of MEXTADD() macro, supplying
appropriate wait argument and checking return value. Before this change
m_extadd() could fail, and kern_sendfile() ignored that.

Sponsored by:	Nginx, Inc.
2013-03-12 12:15:24 +00:00
Michael Tuexen
fbb3471022 Return an error if sctp_peeloff() fails because a socket can't be allocated.
MFC after: 3 days
2013-03-11 17:43:55 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek
7493f24ee6 - Implement two new system calls:
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
	int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

  which allow to bind and connect respectively to a UNIX domain socket with a
  path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
  the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
  in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwatson, jilles, kib, des
2013-03-02 21:11:30 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Pawel Jakub Dawidek
fbda3d5dae Audit sockaddr argument for bind(2), connect(2), accept(2), sendto(2) and
recvfrom(2) syscalls.

Sponsored by:	The FreeBSD Foundation
2013-02-07 00:36:00 +00:00
Pawel Jakub Dawidek
82b316b377 Minor style tweaks. 2013-02-07 00:27:11 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Konstantin Belousov
1c771f9222 After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
Pawel Jakub Dawidek
f3cd980557 Style fixes and simplifications.
MFC after:	1 month
2012-06-11 16:08:03 +00:00
Mateusz Guzik
3b5da8d609 Plug socket refcount leak on error in sys_sctp_peeloff.
Reviewed by:	tuexen
Approved by:	trasz (mentor)
MFC after:	3 days
2012-06-08 08:04:51 +00:00
Gleb Smirnoff
36eeafa0e5 style(9) for r236563. 2012-06-05 05:16:04 +00:00
Gleb Smirnoff
8955d2720f Microoptimisation of code from r236560, also coming from Nginx Inc.
Submitted by:	ru
2012-06-04 14:18:13 +00:00
Gleb Smirnoff
835d890042 Optimise kern_sendfile(): skip cycling through the entire mbuf chain in
m_cat(), storing pointer to last mbuf in chain in local variable and
attaching new mbuf to the end of chain.

Submitter reports that CPU load dropped for > 10% on a web server
serving large files with this optimisation.

Submitted by:	Sergey Budnevitch <sb nginx.com>
2012-06-04 12:49:21 +00:00
Michael Tuexen
99f293a20e Fix bugs which can result in a panic when an non-SCTP socket it
used with an sctp_ system-call which expects an SCTP socket.

MFC after: 3 days.
2012-03-15 14:13:38 +00:00
Konstantin Belousov
526d0bd547 Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Robert Watson
a9d2f8d84f Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
Jonathan Anderson
12bc222e57 Add some checks to ensure that Capsicum is behaving correctly, and add some
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.

Approved by: mentor(rwatson), re(bz)
2011-06-30 10:56:02 +00:00
John Baldwin
c721b93449 Log the socket address passed as the destination to sendto() and sendmsg()
via ktrace.

MFC after:	1 week
2011-06-07 17:40:33 +00:00
Konstantin Belousov
1fe80828e7 After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by:	jhb
X-MFC-note:	do not
2011-04-01 13:28:34 +00:00
Bjoern A. Zeeb
1fb51a12f2 Mfp4 CH=177274,177280,177284-177285,177297,177324-177325
VNET socket push back:
  try to minimize the number of places where we have to switch vnets
  and narrow down the time we stay switched.  Add assertions to the
  socket code to catch possibly unset vnets as seen in r204147.

  While this reduces the number of vnet recursion in some places like
  NFS, POSIX local sockets and some netgraph, .. recursions are
  impossible to fix.

  The current expectations are documented at the beginning of
  uipc_socket.c along with the other information there.

  Sponsored by: The FreeBSD Foundation
  Sponsored by: CK Software GmbH
  Reviewed by:  jhb
  Tested by:    zec

Tested by:	Mikolaj Golub (to.my.trociny gmail.com)
MFC after:	2 weeks
2011-02-16 21:29:13 +00:00
Alan Cox
8189ac85e9 Eliminate unnecessary page hold_count checks. These checks predate
r90944, which introduced a general mechanism for handling the freeing
of held pages.

Reviewed by:	kib@
2011-02-03 14:42:46 +00:00
Konstantin Belousov
9ca9fc5380 If more than one thread allocated sf buffers for sendfile(2), and
each of the threads needs more while current pool of the buffers is
exhausted, then neither thread can make progress.

Switch to nowait allocations after we got first buffer already.

Reported by:	az
Reviewed by:	alc (previous version)
Tested by:	pho
MFC after:	1 week
2011-01-28 17:37:09 +00:00
Pawel Jakub Dawidek
b452cf6317 Just pass M_ZERO to malloc(9) instead of clearing allocated memory separately. 2010-12-14 06:19:13 +00:00
Michael Tuexen
049640c1f0 Implement correct handling of address parameter and
sendinfo for SCTP send calls.

MFC after: 4 weeks.
2010-09-05 20:13:07 +00:00
John Baldwin
7a6f3d7890 Send SIGPIPE to the thread that issued the offending system call
rather than to the entire process.

Reported by:	Anit Chakraborty
Reviewed by:	kib, deischen (concept)
MFC after:	1 week
2010-06-29 20:44:19 +00:00
Michael Tuexen
e1c97831ec * Do not dereference a NULL pointer when calling an SCTP send syscall
not providing a destination address and using ktrace.
* Do not copy out kernel memory when providing sinfo for sctp_recvmsg().
Both bug where reported by Valentin Nechayev.
The first bug results in a kernel panic.
MFC after: 3 days.
2010-06-26 19:26:20 +00:00
Ed Schouten
60ae52f785 Use ISO C99 integer types in sys/kern where possible.
There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.
2010-06-21 09:55:56 +00:00
Alan Cox
f0c0d3998d Remove page queues locking from all sf_buf_mext()-like functions. The page
lock now suffices.

Fix a couple nearby style violations.
2010-05-06 17:43:41 +00:00
Alan Cox
52683078a2 Eliminate a small bit of unneeded code from kern_sendfile(): While
kern_sendfile() is running, the file's vm object can't be destroyed
because kern_sendfile() increments the vm object's reference count.  (Once
kern_sendfile() decrements the reference count and returns, the vm object
can, however, be destroyed.  So, sf_buf_mext() must handle the case where
the vm object is destroyed.)

Reviewed by:	kib
2010-05-06 15:52:08 +00:00
Alan Cox
913814935a This is the first step in transitioning responsibility for synchronizing
access to the page's wire_count from the page queues lock to the page lock.

Submitted by:	kmacy
2010-05-03 05:41:50 +00:00
Konstantin Belousov
a0b8e597e5 Lock the page around hold_count access.
Reviewed by:	alc
2010-05-02 19:25:22 +00:00
Konstantin Belousov
5322f02ec0 Properly handle compat32 calls to sctp generic sendmsd/recvmsg functions that
take iov.

Reviewed by:	tuexen
MFC after:	2 weeks
2010-03-19 10:46:54 +00:00
Konstantin Belousov
fd9d1e7627 Remove dead statement.
Reviewed by:	tuexen
MFC after:	2 weeks
2010-03-19 10:44:02 +00:00
Konstantin Belousov
0a977ede48 Fix two style issues.
MFC after:	2 weeks
2010-03-19 10:41:32 +00:00
Pawel Jakub Dawidek
0454fe84e4 Use NULL instead of 0 when setting up pointer. 2010-02-18 22:12:40 +00:00
Matt Jacob
e7d829a46c Fix argument order in a call to mtx_init.
MFC after:	1 week
2009-12-17 00:22:56 +00:00
Konstantin Belousov
1c89fc757a If socket buffer space appears to be lower then sum of count of already
prepared bytes and next portion of transfer, inner loop of kern_sendfile()
aborts, not preparing next mbuf for socket buffer, and not modifying
any outer loop invariants. The thread loops in the outer loop forever.

Instead of breaking from inner loop, prepare only bytes that fit into
the socket buffer space.

In collaboration with:	pho
Reviewed by:	bz
PR:	kern/138999
MFC after:	2 weeks
2009-11-03 12:52:35 +00:00
Konstantin Belousov
7415a41f4a Fix style issue. 2009-10-29 10:03:08 +00:00
Konstantin Belousov
75ffdc4049 Do not dereference vp->v_mount without holding vnode lock and checking
that the vnode is not reclaimed.

Noted by:	Igor Sysoev <is rambler-co ru>
MFC after:	1 week
2009-10-01 12:50:26 +00:00
Michael Tuexen
8518270e20 Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:02:16 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Robert Watson
15ca46f69d Audit file descriptor numbers for various socket-related system calls.
Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 19:55:11 +00:00
Robert Watson
9e4c1521d5 Define missing audit argument macro AUDIT_ARG_SOCKET(), and
capture the domain, type, and protocol arguments to socket(2)
and socketpair(2).

Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 18:54:49 +00:00
Bjoern A. Zeeb
c03528b663 SCTP needs either IPv4 or IPv6 as lower layer[1].
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.

Discussed with:	tuexen [1]
2009-06-10 14:36:59 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Dmitry Chagin
4202e1be20 Split native socketpair() syscall onto kern_socketpair() which should
be used by kernel consumers and socketpair() itself.

Approved by:	kib (mentor)
MFC after:	1 month
2009-05-31 12:12:38 +00:00
Jeff Roberson
bf422e5f27 - Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
 - Save old file descriptor tables created on expansion until
   the entire descriptor table is freed so that pointers may be
   followed without regard for expanders.
 - Mark the file zone as NOFREE so we may attempt to reference
   potentially freed files.
 - Convert several fget_locked() users to fget_unlocked().  This
   requires us to manage reference counts explicitly but reduces
   locking overhead in the common case.
2009-05-14 03:24:22 +00:00
Marko Zec
2114e063f0 A NOP change: style / whitespace cleanup of the noise that slipped
into r191816.

Spotted by:	bz
Approved by:	julian (mentor) (an earlier version of the diff)
2009-05-08 14:34:25 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Kip Macy
f0b9868d3a sendfile doesn't modify the vnode - acquire vnode lock shared
Reviewed by:	ups, jeffr
2009-04-12 05:19:35 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Robert Watson
17c2fc0cc7 When sendto(2) is called with an explicit destination address
argument, call mac_socket_check_connect() on that address before
proceeding with the send.  Otherwise policies instrumenting the
connect entry point for the purposes of checking destination
addresses will not have the opportunity to check implicit
connect requests.

MFC after:	3 weeks
Sponsored by:	nCircle Network Security, Inc.
2008-05-22 07:18:54 +00:00
Robert Watson
ae11a989e6 When writing trailers in sendfile(2), don't call kern_writev()
while holding the socket buffer lock.  These leads to an
immediate panic due to recursing the socket buffer lock.  This
bug was introduced in uipc_syscalls.c:1.240, but masked by
another bug until that was fixed in uipc_syscalls.c:1.269.

Note that the current fix isn't perfect, but better than
panicking: normally we guarantee that simultaneous invocations
of a system call to write on a stream socket won't be
interlaced, which is ensured by use of the socket buffer sleep
lock.  This is guaranteed for the sendfile headers, but not
trailers.  In practice, this is likely not a problem, but
should be fixed.

MFC after:	3 days
Pointy hat to:	andre (1.240), cperciva (1.269)
2008-04-27 15:50:00 +00:00
Ruslan Ermilov
ea26d58729 Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by:	arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
2008-03-25 09:39:02 +00:00
Colin Percival
491869163b After finishing sending file data in sendfile(2), don't forget to send
the provided trailers.  This has been broken since revision 1.240.

Submitted by:	Dan Nelson
PR:		kern/120948
"sounds ok to me" from:	phk
MFC after:	3 days
2008-02-24 00:07:00 +00:00
Dag-Erling Smørgrav
60e15db992 This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload
consists of the null-terminated name and the contents of any structure
you wish to record.  A new ktrstruct() function constructs and emits a
KTR_STRUCT record.  It is accompanied by convenience macros for struct
stat and struct sockaddr.

In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.

Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.

Derived from patches by Andrew Li <andrew2.li@citi.com>.

PR:		kern/117836
MFC after:	3 weeks
2008-02-23 01:01:49 +00:00
Simon L. B. Nielsen
1b7089994c Fix sendfile(2) write-only file permission bypass.
Security:	FreeBSD-SA-08:03.sendfile
Submitted by:	kib
2008-02-14 11:44:31 +00:00
Poul-Henning Kamp
b75a1171d8 Give sendfile(2) a SF_SYNC flag which makes it wait until all mbufs
referencing the files VM pages are returned from the network stack,
making changes to the file safe.

This flag does not guarantee that the data has been transmitted to the
other end.
2008-02-03 15:54:41 +00:00
Poul-Henning Kamp
cf827063a9 Give MEXTADD() another argument to make both void pointers to the
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.

Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.

Likely break the only non-convetional user of the API, after informing
the relevant committer.

Update the mbuf(9) manual page, which was already out of sync on
this point.

Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.

This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.

This does not affect the memory layout or size of mbufs.

Parental oversight by:	sam and rwatson.

No MFC is anticipated.
2008-02-01 19:36:27 +00:00
Robert Watson
265de5bb62 Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
  might be sleeping (potentially indefinitely) while holding sblock(),
  such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
  delivered to the thread calling sorflush() doesn't cause sblock() to
  fail.  The sblock() is required to ensure that all other socket
  consumer threads have, in fact, left, and do not enter, the socket
  buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag.  When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by:	bz, kmacy
Reported by:	Jos Backus <jos at catnook dot com>
2008-01-31 08:22:24 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
Jeff Roberson
397c19d175 Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
   in such a way that the ops vector is only valid after the data, type,
   and flags are valid.
 - Protect f_flag and f_count with atomic operations.
 - Remove the global list of all files and associated accounting.
 - Rewrite the unp garbage collection such that it no longer requires
   the global list of all files and instead uses a list of all unp sockets.
 - Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by:	kris, pho
2007-12-30 01:42:15 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Randall Stewart
2afb3e849f - During shutdown pending, when the last sack came in and
the last message on the send stream was "null" but still
  there, a state we allow, we could get hung and not clean
  it up and wait for the shutdown guard timer to clear the
  association without a graceful close. Fix this so that
  that we properly clean up.
- Added support for Multiple ASCONF per new RFC. We only
  (so far) accept input of these and cannot yet generate
  a multi-asconf.
- Sysctl'd support for experimental Fast Handover feature. Always
  disabled unless sysctl or socket option changes to enable.
- Error case in add-ip where the peer supports AUTH and ADD-IP
  but does NOT require AUTH of ASCONF/ASCONF-ACK. We need to
  ABORT in this case.
- According to the Kyoto summit of socket api developers
  (Solaris, Linux, BSD). We need to have:
   o non-eeor mode messages be atomic - Fixed
   o Allow implicit setup of an assoc in 1-2-1 model if
     using the sctp_**() send calls - Fixed
   o Get rid of HAVE_XXX declarations - Done
   o add a sctp_pr_policy in hole in sndrcvinfo structure - Done
   o add a PR_SCTP_POLICY_VALID type flag - yet to-do in a future patch!
- Optimize sctp6 calls to reuse code in sctp_usrreq. Also optimize
  when we close sending out the data and disabling Nagle.
- Change key concatenation order to match the auth RFC
- When sending OOTB shutdown_complete always do csum.
- Don't send PKT-DROP to a PKT-DROP
- For abort chunks just always checksums same for
  shutdown-complete.
- inpcb_free front state had a bug where in queue
  data could wedge an assoc. We need to just abandon
  ones in front states (free_assoc).
- If a peer sends us a 64k abort, we would try to
  assemble a response packet which may be larger than
  64k. This then would be dropped by IP. Instead make
  a "minimum" size for us 64k-2k (we want at least
  2k for our initack). If we receive such an init
  discard it early without all the processing.
- When we peel off we must increment the tcb ref count
  to keep it from being freed from underneath us.
- handling fwd-tsn had bugs that caused memory overwrites
  when given faulty data, fixed so can't happen and we
  also stop at the first bad stream no.
- Fixed so comm-up generates the adaption indication.
- peeloff did not get the hmac params copied.
- fix it so we lock the addr list when doing src-addr selection
  (in future we need to use a multi-reader/one writer lock here)
- During lowlevel output, we could end up with a _l_addr set
  to null if the iterator is calling the output routine. This
  means we would possibly crash when we gather the MTU info.
  Fix so we only do the gather where we have a src address
  cached.
- we need to be sure to set abort flag on conn state when
  we receive an abort.
- peeloff could leak a socket. Moved code so the close will
  find the socket if the peeloff fails (uipc_syscalls.c)

Approved by:	re@freebsd.org(Ken Smith)
2007-08-27 05:19:48 +00:00
Robert Watson
0bf686c125 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
Randall Stewart
b8709d23c5 - Add some needed error checking on bad fd passing in the sctp
syscalls.
Approved by:	re@freebsd.org (Ken Smith)
Obtained from:	Weongyo Jeong (weongyo.jeong@gmail.com)
2007-07-02 12:50:53 +00:00
Andre Oppermann
1c9dbd1567 In kern_sendfile() adjust byte accounting of the file sending loop to
ignore the size of any headers that were passed with the sendfile(2)
system call.  Otherwise the file sent will be truncated by the header
size if the nbytes parameter was provided.  The bug doesn't show up
when either nbytes is zero, meaning send the whole file, or no header
iovec is provided.

Resolve a potential error aliasing of errors from the VM and sf_buf
parts and the protocol send parts where an error of the latter over-
writes one of the former.

Update comments.

The byte accounting bug wasn't seen in earlier because none of the popular
sendfile(2) consumers, Apache, lighttpd and our ftpd(8) use it in modes
that trigger it.  The varnish HTTP proxy makes full use of it and exposed
the problem.

Bug found by:	phk
Tested by:	phk
2007-05-19 20:50:59 +00:00
Robert Watson
d19e16a72c Generally migrate to ANSI function headers, and remove 'register' use. 2007-05-16 20:41:08 +00:00