Remove git data without breaking fetch

Question

I'm processing a lot of Git repositories with software I wrote. The basic process is like this:

Do git clone --no-checkout --filter=blob:none <url>
Read interesting data from the repository and put it in a database.
Later, do git fetch --prune --prune-tags --force to update the data with the newest changes.
Read all data from the repository that is not yet in the database.
Go to 3.

This works, but since I'm processing thousands of repos, even though I use --filter=blob:none, the repositories still take up a lot of disk space. (more than a terabyte). And I don't need that data. Once I have processed a git object (commit, tree or tag) I don't need it anymore.

Is there a way to delete all, or most of the objects in the repository, while keeping the ability to fetch changes? And also not have to fetch those objects again?

I've looked at shallow clones, promise files and replace references, but it's all very complicated and every command/option seems to be doing something that is just a little bit different from what I need.

One option : rm -rf repo, then clone with blob:none again. Another option: you may also keep only a copy of the current packed-refs, and use that to mark all these commits as "shallow" (--depth=1) on next update. — LeGEC
– LeGEC, Commented Mar 7 at 17:53

dani-vta · Accepted Answer · 2025-03-12 11:44:54Z

1

Git doesn't really provide a command to delete any object in the object store. At most, you can use git gc to remove dangling (unreachable) objects, but this is not your case.

In your scenario, once you've processed all the data, you could keep track of the current top commit for each ref, delete the whole repo, and then clone it again with the option --depth=1. Since --depth implies --single-branch, the option --no-single-branch is necessary to fetch the histories near the tips of all branches.

git clone --no-checkout --filter=blob:none --no-single-branch --depth=1 <url>

However, after the second clone, make sure that the head commit of each branch corresponds to the last tip you've processed. If the tips don't correspond, you could force fetch until the previous head is included in the object store, incrementing the depth of a certain amount at each iteration.

git fetch --force --depth=<n> origin <branch>

edited Mar 12 at 11:44

answered Mar 7 at 17:47

dani-vta

8,61914 gold badges55 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user42723 Mar 7 at 17:55

The other branches and tags are also included in the clone, which is I want. In the new clone, I should specify --shallow-exclude for each commit that was reference by a branch/tag of the last clone, right?

user42723 Mar 7 at 17:57

Does this also prevent fetching tree objects that had already been fetched in a previous clone?

dani-vta Mar 7 at 17:57

No, in that case you need to specify the commit obtained from git merge-base of all the commits you're interested (all refs and tags)

dani-vta Mar 7 at 18:03

@user42723 you would download the same tree objects again, but only up to the merge-base of all the refs you had.

dani-vta Mar 12 at 11:46

@user42723 I've updated my answer. My previous solution couldn't work, as --shallo-exclude accepts only a ref. Therefore, the commit-id returned by a merge base would result in an error.

Collectives™ on Stack Overflow

Remove git data without breaking fetch

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related