Commit Graph

34 Commits

Author SHA1 Message Date
Ævar Arnfjörð Bjarmason 9dc523aa0e Makefile + hash.h: remove PPC_SHA1 implementation
Remove the PPC_SHA1 implementation added in a6ef3518f9 ([PATCH] PPC
assembly implementation of SHA1, 2005-04-22). When this was added
Apple consumer hardware used the PPC architecture, and the
implementation was intended to improve SHA-1 speed there.

Since it was added we've moved to using sha1collisiondetection by
default, and anyone wanting hard-rolled non-DC SHA-1 implementation
can use OpenSSL's via the OPENSSL_SHA1 knob.

The PPC_SHA1 originally originally targeted 32 bit PPC, and later the
64 bit PPC 970 (a.k.a. Apple PowerPC G5). See 926172c5e4 (block-sha1:
improve code on large-register-set machines, 2009-08-10) for a
reference about the performance on G5 (a comment in block-sha1/sha1.c
being removed here).

I can't get it to do anything but segfault on both the BE and LE POWER
machines in the GCC compile farm[1]. Anyone who's concerned about
performance on PPC these days is likely to be using the IBM POWER
processors.

There have been proposals to entirely remove non-sha1collisiondetection
implementations from the tree[2]. I think per [3] that would be a bit
overzealous. I.e. there are various set-ups git's speed is going to be
more important than the relatively implausible SHA-1 collision attack,
or where such attacks are entirely mitigated by other means (e.g. by
incoming objects being checked with DC_SHA1).

But that really doesn't apply to PPC_SHA1 in particular, which seems
to have outlived its usefulness.

As this gets rid of the only in-tree *.S assembly file we can remove
the small bits of logic from the Makefile needed to build objects
from *.S (as opposed to *.c)

The code being removed here was also throwing warnings with the
"-pedantic" flag, it could have been fixed as 544d93bc3b (block-sha1:
remove use of obsolete x86 assembly, 2022-03-10) did for block-sha1/*,
but as noted above let's remove it instead.

1. https://cfarm.tetaneutral.net/machines/list/
   Tested on gcc{110,112,135,203}, a mixture of POWER [789] ppc64 and
   ppc64le. All segfault in anything needing object
   hashing (e.g. t/t1007-hash-object.sh) when compiled with
   PPC_SHA1=Y.
2. https://lore.kernel.org/git/20200223223758.120941-1-mh@glandium.org/
3. https://lore.kernel.org/git/20200224044732.GK1018190@coredump.intra.peff.net/

Acked-by: brian m. carlson" <sandals@crustytoothpaste.net>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-31 14:37:31 -07:00
brian m. carlson 544d93bc3b block-sha1: remove use of obsolete x86 assembly
In the block SHA-1 code, we have special assembly code for i386 and
amd64 to perform rotations with assembly.  This is supposed to help pick
the correct rotation operation depending on which rotation is smaller,
which can help some systems perform slightly better, since any circular
rotation can be specified as either a rotate left or a rotate right.
However, this isn't needed, so we should remove it.

First, SHA-1, like SHA-2, uses fixed constant rotates.  Thus, all
rotation amounts are known at compile time and are in fact baked into
the code.  Fortunately, peephole optimizers recognize rotations
specified in the normal way and automatically emit the correct code,
including a preference for choosing a rotate left versus a rotate right.
This has been the case for well over a decade, and is a standard example
of the utility of a peephole optimizer.

Moreover, all modern CPUs, with the exception of extremely limited
embedded CPUs such as some Cortex-M processors, provide a barrel
shifter, which lets the CPU perform rotates of any bit amount in
constant time.  This is valuable for many cryptographic algorithms to
improve performance, and is required to prevent timing attacks in
algorithms which use data-dependent rotations (which don't include the
hash algorithms we use).  As a result, even though the compiler does the
correct optimization, it isn't even needed here and either a left or a
right rotate is equally acceptable.

In fact, the SHA-256 code already takes this into account and just
writes the simple code using an inline function to let the compiler
optimize it for us.

The downside of using this code, however, is that it uses a GCC
extension, which makes the compiler complain when using -pedantic unless
it's prefixed with __extension__.  We could fix that, but since it's
not needed, let's just remove it.  We haven't noticed this because
almost everyone uses the SHA1DC code instead, but it still shows up for
some people.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 11:18:05 -08:00
René Scharfe 3d8cbbf2c3 block-sha1: drop trailing semicolon from macro definition
23119ffb4e (block-sha1: put expanded macro parameters in parentheses,
2012-07-22) added a trailing semicolon to the definition of SHA_MIX
without explanation.  It doesn't matter with the current code, but make
sure to avoid potential surprises by removing it again.

This allows the macro to be used almost like a function: Users can
combine it with operators of their choice, but still must not pass an
expression with side-effects as a parameter, as it would be evaluated
multiple times.

Signed-off-by: René Scharfe <l.s.r@web.de>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-17 10:20:01 -07:00
Jeff King 9bb4542b8c block-sha1: take a size_t length parameter
The block-sha1 implementation takes an "unsigned long" for the length of
a buffer to hash, but our hash algorithm wrappers take a size_t, as do
other implementations we support like openssl or sha1dc. On many
systems, including Linux, these two are equivalent, but they are not on
Windows (where only a "long long" is 64 bits). As a result, passing
large chunks to a single the_hash_algo->update_fn() would produce wrong
answers there.

Note that we don't need to update any other sizes outside of the
function interface. We store the cumulative size in a "long long" (which
we must do since we hash things bigger than 4GB, like packfiles, even on
32-bit platforms). And internally, we break that size_t len down into
64-byte blocks to feed into the guts of the algorithm.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-16 13:41:35 -08:00
Atousa Pahlevan Duprat 3bc72fde3f sha1: provide another level of indirection for the SHA-1 functions
The git source uses git_SHA1_Update() and friends to call into the
code that computes the hashes.  Traditionally, we used to map these
directly to underlying implementation of the SHA-1 hash (e.g.
SHA1_Update() from OpenSSL or blk_SHA1_Update() from block-sha1/).

This arrangement however makes it hard to tweak behaviour of the
underlying implementation without fully replacing.  If we want to
introduce a tweaked_SHA1_Update() wrapper to implement the "Update"
in a slightly different way, for example, the implementation of the
wrapper still would want to call into the underlying implementation,
but tweaked_SHA1_Update() cannot call git_SHA1_Update() to get to
the underlying implementation (often but not always SHA1_Update()).

Add another level of indirection that maps platform_SHA1_Update()
and friends to their underlying implementations, and by default make
git_SHA1_Update() and friends map to platform_SHA1_* functions.

Doing it this way will later allow us to map git_SHA1_Update() to
tweaked_SHA1_Update(), and the latter can use platform_SHA1_Update()
in its implementation.

Signed-off-by: Atousa Pahlevan Duprat <apahlevan@ieee.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-11-05 10:35:11 -08:00
Junio C Hamano 0f9e62e084 Merge branch 'jk/pack-bitmap'
Borrow the bitmap index into packfiles from JGit to speed up
enumeration of objects involved in a commit range without having to
fully traverse the history.

* jk/pack-bitmap: (26 commits)
  ewah: unconditionally ntohll ewah data
  ewah: support platforms that require aligned reads
  read-cache: use get_be32 instead of hand-rolled ntoh_l
  block-sha1: factor out get_be and put_be wrappers
  do not discard revindex when re-preparing packfiles
  pack-bitmap: implement optional name_hash cache
  t/perf: add tests for pack bitmaps
  t: add basic bitmap functionality tests
  count-objects: recognize .bitmap in garbage-checking
  repack: consider bitmaps when performing repacks
  repack: handle optional files created by pack-objects
  repack: turn exts array into array-of-struct
  repack: stop using magic number for ARRAY_SIZE(exts)
  pack-objects: implement bitmap writing
  rev-list: add bitmap mode to speed up object lists
  pack-objects: use bitmaps when packing objects
  pack-objects: split add_object_entry
  pack-bitmap: add support for bitmap indexes
  documentation: add documentation for the bitmap format
  ewah: compressed bitmap implementation
  ...
2014-02-27 14:01:48 -08:00
Jeff King 802b123366 block-sha1: factor out get_be and put_be wrappers
The BLK_SHA1 code has optimized wrappers for doing endian
conversions on memory that may not be aligned. Let's pull
them out so that we can use them elsewhere, especially the
time-tested list of platforms that prefer each strategy.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-01-23 14:03:21 -08:00
Junio C Hamano 6b2dd0e56b block-sha1/sha1.c: have SP around arithmetic operators
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-16 10:27:27 -07:00
Junio C Hamano ebcfa444c4 Merge branch 'jn/block-sha1'
The code to load a word one-byte-at-a-time was optimized into a
word-wide load instruction even when the pointer was not aligned,
which caused issues on architectures that do not like unaligned
access.

* jn/block-sha1:
  Makefile: BLK_SHA1 does not require fast htonl() and unaligned loads
  block-sha1: put expanded macro parameters in parentheses
  block-sha1: avoid pointer conversion that violates alignment constraints
2012-07-23 20:56:47 -07:00
Jonathan Nieder 23119ffb4e block-sha1: put expanded macro parameters in parentheses
't' is currently always a numeric constant, but it can't hurt to
prepare for the day that it becomes useful for a caller to pass in a
more complex expression.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-22 21:13:53 -07:00
Jonathan Nieder 5f6a11259a block-sha1: avoid pointer conversion that violates alignment constraints
With 660231aa (block-sha1: support for architectures with memory
alignment restrictions, 2009-08-12), blk_SHA1_Update was modified to
access 32-bit chunks of memory one byte at a time on arches that
prefer that:

	#define get_be32(p)    ( \
		(*((unsigned char *)(p) + 0) << 24) | \
		(*((unsigned char *)(p) + 1) << 16) | \
		(*((unsigned char *)(p) + 2) <<  8) | \
		(*((unsigned char *)(p) + 3) <<  0) )

The code previously accessed these values by just using htonl(*p).

Unfortunately, Michael noticed on an Alpha machine that git was using
plain 32-bit reads anyway.  As soon as we convert a pointer to int *,
the compiler can assume that the object pointed to is correctly
aligned as an int (C99 section 6.3.2.3 "pointer conversions"
paragraph 7), and gcc takes full advantage by using a single 32-bit
load, resulting in a whole bunch of unaligned access traps.

So we need to obey the alignment constraints even when only dealing
with pointers instead of actual values.  Do so by changing the type
of 'data' to void *.  This patch renames 'data' to 'block' at the same
time to make sure all references are updated to reflect the new type.

Reported-tested-and-explained-by: Michael Cree <mcree@orcon.net.nz>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-22 21:11:35 -07:00
Ramsay Jones 078e9bce1e msvc: Select the "fast" definition of the {get,put}_be32() macros
On Intel machines, the msvc compiler defines the CPU architecture
macros _M_IX86 and _M_X64 (equivalent to __i386__ and __x86_64__
respectively). Use these macros in the pre-processor expression
to select the "fast" definition of the {get,put}_be32() macros.

Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Acked-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-06-27 21:59:32 -07:00
Ramsay Jones 9eafa1201b msvc: Fix some compiler warnings
In particular, using the normal (or production) compiler
warning level (-W3), msvc complains as follows:

.../sha1.c(244) : warning C4018: '<' : signed/unsigned mismatch
.../sha1.c(270) : warning C4244: 'function' : conversion from \
   'unsigned __int64' to 'unsigned long', possible loss of data
.../sha1.c(271) : warning C4244: 'function' : conversion from \
   'unsigned __int64' to 'unsigned long', possible loss of data

Note that gcc issues a similar complaint about line 244 when
compiling with -Wextra.

Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-06-25 11:04:16 -07:00
Nicolas Pitre 30ae47b4cc remove ARM and Mozilla SHA1 implementations
They are both slower than the new BLK_SHA1 implementation, so it is
pointless to keep them around.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-18 14:19:40 -07:00
Nicolas Pitre e9c5dcd131 block-sha1: guard gcc extensions with __GNUC__
With this, the code should now be portable to any C compiler.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-18 14:18:36 -07:00
Nicolas Pitre 51ea55190b make sure byte swapping is optimal for git
We rely on ntohl() and htonl() to perform byte swapping in many places.
However, some platforms have libraries providing really poor
implementations of those which might cause significant performance
issues, especially with the block-sha1 code.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-18 14:16:37 -07:00
Nicolas Pitre d5f6a96fa4 block-sha1: make the size member first in the context struct
This is a 64-bit value, hence having it first provides a better
alignment.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-18 12:26:01 -07:00
Brandon Casey a12218572f block-sha1/sha1.c: silence compiler complaints by casting void * to char *
Some compilers produce errors when arithmetic is attempted on pointers to
void.  We want computations done on byte addresses, so cast them to char *
to work them around.

Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-14 19:13:00 -07:00
Nicolas Pitre ee7dc310af block-sha1: more good unaligned memory access candidates
In addition to X86, PowerPC and S390 are capable of unaligned memory
accesses.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-13 10:41:02 -07:00
Nicolas Pitre 660231aa97 block-sha1: support for architectures with memory alignment restrictions
This is needed on architectures with poor or non-existent unaligned memory
support and/or no fast byte swap instruction (such as ARM) by using byte
accesses to memory and shifting the result together.

This also makes the code portable, therefore the byte access methods are
the defaults.  Any architecture that properly supports unaligned word
accesses in hardware simply has to enable the alternative methods.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-12 13:36:32 -07:00
Nicolas Pitre dc52fd2973 block-sha1: split the different "hacks" to be individually selected
This is to make it easier for them to be selected individually depending
on the architecture instead of the other way around i.e. having each
architecture select a list of hacks up front.  That makes for clearer
documentation as well.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-12 13:35:54 -07:00
Nicolas Pitre 30ba0de726 block-sha1: move code around
Move the code around so specific architecture hacks are defined first.
Also make one line comments actually one line.  No code change.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-12 13:32:54 -07:00
Linus Torvalds 926172c5e4 block-sha1: improve code on large-register-set machines
For x86 performance (especially in 32-bit mode) I added that hack to write
the SHA1 internal temporary hash using a volatile pointer, in order to get
gcc to not try to cache the array contents. Because gcc will do all the
wrong things, and then spill things in insane random ways.

But on architectures like PPC, where you have 32 registers, it's actually
perfectly reasonable to put the whole temporary array[] into the register
set, and gcc can do so.

So make the 'volatile unsigned int *' cast be dependent on a
SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just
x86 and x86-64.  With that, the routine is fairly reasonable even when
compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on
a G5:

 * Paulus asm version:       about 3.67s
 * Yours with no change:     about 5.74s
 * Yours without "volatile": about 3.78s

so with this the C version is within about 3% of the asm one.

And add a lot of commentary on what the heck is going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-10 17:26:51 -07:00
Linus Torvalds 66c9c6c0fb block-sha1: improved SHA1 hashing
I think I have found a way to avoid the gcc crazyness.

Lookie here:

	#             TIME[s] SPEED[MB/s]
	rfc3174         5.094       119.8
	rfc3174         5.098       119.7
	linus           1.462       417.5
	linusas         2.008         304
	linusas2        1.878         325
	mozilla         5.566       109.6
	mozillaas       5.866       104.1
	openssl         1.609       379.3
	spelvin         1.675       364.5
	spelvina        1.601       381.3
	nettle          1.591       383.6

notice? I outperform all the hand-tuned asm on 32-bit too. By quite a
margin, in fact.

Now, I didn't try a P4, and it's possible that it won't do that there, but
the 32-bit code generation sure looks impressive on my Nehalem box. The
magic? I force the stores to the 512-bit hash bucket to be done in order.
That seems to help a lot.

The diff is trivial (on top of the "rename registers with cpp" patch), as
appended. And it does seem to fix the P4 issues too, although I can
obviously (once again) only test Prescott, and only in 64-bit mode:

	#             TIME[s] SPEED[MB/s]
	rfc3174         1.662       36.73
	rfc3174          1.64       37.22
	linus          0.2523       241.9
	linusas        0.4367       139.8
	linusas2       0.4487         136
	mozilla        0.9704        62.9
	mozillaas      0.9399       64.94

that's some really impressive improvement. All from just saying "do the
stores in the order I told you to, dammit!" to the compiler.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-07 22:32:46 -07:00
Linus Torvalds 30d12d4c16 block-sha1: perform register rotation using cpp
Instead of letting the compiler to figure out the optimal way to rotate
register usage, explicitly rotate the register names with cpp.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-07 22:19:45 -07:00
Linus Torvalds 5d5210c35a block-sha1: get rid of redundant 'lenW' context
.. and simplify the ctx->size logic.

We now count the size in bytes, which means that 'lenW' was always just
the low 6 bits of the total size, so we don't carry it around separately
any more.  And we do the 'size in bits' shift at the end.

Suggested by Nicolas Pitre and linux@horizon.com.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds e869e113c8 block-sha1: Use '(B&C)+(D&(B^C))' instead of '(B&C)|(D&(B|C))' in round 3
It's an equivalent expression, but the '+' gives us some freedom in
instruction selection (for example, we can use 'lea' rather than 'add'),
and associates with the other additions around it to give some minor
scheduling freedom.

Suggested-by: linux@horizon.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds ab14c823df block-sha1: macroize the rounds a bit further
Avoid repeating the shared parts of the different rounds by adding a
macro layer or two. It was already more cpp than C.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds 7b5075fcfb block-sha1: re-use the temporary array as we calculate the SHA1
The mozilla-SHA1 code did this 80-word array for the 80 iterations.  But
the SHA1 state is really just 512 bits, and you can actually keep it in
a kind of "circular queue" of just 16 words instead.

This requires us to do the xor updates as we go along (rather than as a
pre-phase), but that's really what we want to do anyway.

This gets me really close to the OpenSSL performance on my Nehalem.
Look ma, all C code (ok, there's the rol/ror hack, but that one doesn't
strictly even matter on my Nehalem, it's just a local optimization).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds 139e3456ec block-sha1: make the 'ntohl()' part of the first SHA1 loop
This helps a teeny bit.  But what I -really- want to do is to avoid the
whole 80-array loop, and do the xor updates as I go along..

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Junio C Hamano fd536d3439 block-sha1: minor fixups
Bert Wesarg noticed non-x86 version of SHA_ROT() had a typo.
Also spell in-line assembly as __asm__(), otherwise I seem to get
error: implicit declaration of function 'asm' from my compiler.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Linus Torvalds b8e48a89b8 block-sha1: try to use rol/ror appropriately
Use the one with the smaller constant.  It _can_ generate slightly
smaller code (a constant of 1 is special), but perhaps more importantly
it's possibly faster on any uarch that does a rotate with a loop.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:45 -07:00
Junio C Hamano b26a9d5089 block-sha1: undo ctx->size change
Undo the change I picked up from the mailing list discussion suggested
by Nico, not because it is wrong, but it will be done at the end of the
follow-up series.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-06 13:56:19 -07:00
Linus Torvalds d7c208a92e Add new optimized C 'block-sha1' routines
Based on the mozilla SHA1 routine, but doing the input data accesses a
word at a time and with 'htonl()' instead of loading bytes and shifting.

It requires an architecture that is ok with unaligned 32-bit loads and a
fast htonl().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-05 19:28:21 -07:00