0

I'm processing a lot of Git repositories with software I wrote. The basic process is like this:

  1. Do git clone --no-checkout --filter=blob:none <url>
  2. Read interesting data from the repository and put it in a database.
  3. Later, do git fetch --prune --prune-tags --force to update the data with the newest changes.
  4. Read all data from the repository that is not yet in the database.
  5. Go to 3.

This works, but since I'm processing thousands of repos, even though I use --filter=blob:none, the repositories still take up a lot of disk space. (more than a terabyte). And I don't need that data. Once I have processed a git object (commit, tree or tag) I don't need it anymore.

Is there a way to delete all, or most of the objects in the repository, while keeping the ability to fetch changes? And also not have to fetch those objects again?

I've looked at shallow clones, promise files and replace references, but it's all very complicated and every command/option seems to be doing something that is just a little bit different from what I need.

1
  • 1
    One option : rm -rf repo, then clone with blob:none again. Another option: you may also keep only a copy of the current packed-refs, and use that to mark all these commits as "shallow" (--depth=1) on next update. Commented Mar 7 at 17:53

1 Answer 1

1

Git doesn't really provide a command to delete any object in the object store. At most, you can use git gc to remove dangling (unreachable) objects, but this is not your case.

In your scenario, once you've processed all the data, you could keep track of the current top commit for each ref, delete the whole repo, and then clone it again with the option --depth=1. Since --depth implies --single-branch, the option --no-single-branch is necessary to fetch the histories near the tips of all branches.

git clone --no-checkout --filter=blob:none --no-single-branch --depth=1 <url>

However, after the second clone, make sure that the head commit of each branch corresponds to the last tip you've processed. If the tips don't correspond, you could force fetch until the previous head is included in the object store, incrementing the depth of a certain amount at each iteration.

git fetch --force --depth=<n> origin <branch>
Sign up to request clarification or add additional context in comments.

5 Comments

The other branches and tags are also included in the clone, which is I want. In the new clone, I should specify --shallow-exclude for each commit that was reference by a branch/tag of the last clone, right?
Does this also prevent fetching tree objects that had already been fetched in a previous clone?
No, in that case you need to specify the commit obtained from git merge-base of all the commits you're interested (all refs and tags)
@user42723 you would download the same tree objects again, but only up to the merge-base of all the refs you had.
@user42723 I've updated my answer. My previous solution couldn't work, as --shallo-exclude accepts only a ref. Therefore, the commit-id returned by a merge base would result in an error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.