0

Here is my precommit hook (found on stack overflow) :

#!/bin/bash

# taken from : https://stackoverflow.com/questions/39576257/how-to-limit-file-size-on-commit 
hard_limit=$(git config hooks.filesizehardlimit)
soft_limit=$(git config hooks.filesizesoftlimit)
: ${hard_limit:=1000000}
: ${soft_limit:=500000}

status=0

bytesToHuman() {
  b=${1:-0}; d=''; s=0; S=({,K,M,G,T,P,E,Z,Y}B)
  while ((b > 1000)); do
    d="$(printf ".%01d" $((b % 1000 * 10 / 1000)))"
    b=$((b / 1000))
    let s++
  done
  echo "$b$d${S[$s]}"
}

# Iterate over the zero-delimited list of staged files.
while IFS= read -r -d '' file ; do
  hash=$(git ls-files -s "$file" | cut -d ' ' -f 2)
  size=$(git cat-file -s "$hash")
  #hash=0
  #size=0
  if (( $size > $hard_limit )); then
    echo "Error: Cannot commit '$file' because it is $(bytesToHuman $size), which exceeds the hard size limit of $(bytesToHuman $hard_limit)."
    status=1
  elif (( $size > $soft_limit )); then
    echo "Warning: '$file' is $(bytesToHuman $size), which exceeds the soft size limit of $(bytesToHuman $soft_limit). Please double check that you intended to commit this file."
  fi
done < <(git diff -z --staged --name-only --diff-filter=d)
exit $status

It works great but now my commit are very slow.

Without the hook or if I comment "hash" (git ls-files command) and "size" (git cat-file command) variables the execution time of the commit is 3 seconds. Using the complete script, it takes 300 seconds.

I have 1000 files in the commit. Do you know if this is normal to have such a latency when using this hook? My computer is quite recent (Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz 2.90 GHz, 32Go ram) and I don't have performance problems elsewhere.

15
  • 2
    Not a git problem.... git only runs whatever the hook has got inside... if it is slow, you need to work on the hook to make it faster. IOW, I would remove the git tag. Commented Apr 29 at 9:44
  • 1
    Is this on Windows? When you say, "I have 1000 files in the commit", do you mean 1000 modified files or 1000 files in total in the repository? Commented Apr 29 at 10:50
  • 5
    First thing to do is use git diff --raw and extract the hash from there, pass the list of hashes to git cat-file --batch-check="%(objectname) %(objectsize)" to find the object sizes. Filter the large ones. So far there's no loop needed. Then from there it is a bit more tedious to work back to the file name, but you will have to do that only when there is an error to report, not during the normal operation. Commented Apr 29 at 11:09
  • 2
    Those commands don’t take long. They certainly don’t on Linux. It sounds like the problem is with running many processes under your system. Commented Apr 29 at 11:46
  • 3
    re @j6t's method, yep. I'd carry the pathnames from the raw, git diff-index -z --cached --diff-filter=d @ | awk '{id=$4;getline;print id,$0}' RS='\0' | git cat-file --batch-check='%(objectsize) %(rest)' will do fine in the absence of actual newlines in your paths. Commented Apr 29 at 17:20

1 Answer 1

0

Thanks for all your remarks here is the resulting script and it is very quick even for 1000 files (a few seconds).

Some remarks:

  • -n was missing to the sort option to manage string as number

  • I don't try to convert to "human" the size provided by the git diff-index command. I guess also that because of pipe, the function bytestohuman may not be usable.

#!/bin/bash

hard_limit=$(git config hooks.filesizehardlimit)
soft_limit=$(git config hooks.filesizesoftlimit)
: ${hard_limit:=1000000}
: ${soft_limit:=500000}

status=0

bytesToHuman() {
  b=${1:-0}; d=''; s=0; S=({,K,M,G,T,P,E,Z,Y}B)
  while ((b > 1000)); do
    d="$(printf ".%01d" $((b % 1000 * 10 / 1000)))"
    b=$((b / 1000))
    let s++
  done
  echo "$b$d${S[$s]}"
}


# Faster check (only one git command). no loop on all files
# for file managed by LFS, there are ignored because their size is very low because from git point of view, these files became a link to LFS server.
max=$(git diff  --raw --staged \
    | cut -d' ' -f4 \
    | git cat-file --batch-check="%(objectname) %(objectsize)" \
    | cut -d' ' -f2 \
    | sort --reverse -n \
    | head -1)
echo $max

# if maximum size is upper hard and woft limit, we need to display some files
# if not, nothing is done (fast execution)
if (( $max > $hard_limit || $max > $soft_limit))
then
    if (( $max > $hard_limit))
    then
        echo "Error: Cannot commit because one file exceeds the hard size limit of $(bytesToHuman $hard_limit)."
        status=1
    fi
    echo "Error: List of big files :"
    git diff-index -z --cached --diff-filter=d @ | awk '{id=$4;getline;print id,$0}' RS='\0' | git cat-file --batch-check='%(objectsize) %(rest)' | awk -v hardlimit="$hard_limit" '$1 > hardlimit' | while read -r line; do    
        echo "$line"
    done
    echo "Error: List of medium allowed files :"
    git diff-index -z --cached --diff-filter=d @ | awk '{id=$4;getline;print id,$0}' RS='\0' | git cat-file --batch-check='%(objectsize) %(rest)' | awk -v softlimit="$soft_limit" '$1 > softlimit' | awk -v hardlimit="$hard_limit" '$1 < hardlimit'| while read -r line; do    
        echo "$line"
    done
fi

exit $status
Sign up to request clarification or add additional context in comments.

7 Comments

The -z in the git diff-index command has two effects: It separates the output lines with NUL bytes instead of newline characters, and it does not quote file names with special characters. But it also makes it so that the while read loop eats and ignores the input. I suggest to replace the loop with tr '\0' '\012' to turn the NUL characters back into newline characters.
thanks for helping, I don't understand what is the drawback of the actual solution ? if a path name has a newline characters it won't work ? it is possible to have a newline characters in a path name ?
The Git database allows every byte in path names except the NUL byte. A carefully written script should permit this, too. That said, I wouldn't blame you if you ignored the possibility of newlines in path names in your personal scripts, in particular, when you are on Windows, where they are forbidden.
Try git diff --raw -z HEAD~ and notice the NUL bytes in the stream. Then try git diff --raw -z HEAD~ | while read -r line; do echo "$line"; done and notice that there is no output. Finally, try git diff --raw -z HEAD~ | tr '\0' '\012' and see output again. Take this as a hint that the script as written will not write file names in the error case.
I tried the "git diff-index --cached --diff-filter=d @" without the -z option and I see the different lines. using -z option I have one big line but I don't "see" the null bytes. whereas using "git diff --raw -z HEAD~" I see the a strange character: "^@" that seems to replace tabulation and newlines. I don't understand the difference of behavior when I use the -z option for the same command (viewing ^@ and not viewing ^@)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.