I'm processing a lot of Git repositories with software I wrote. The basic process is like this:
- Do
git clone --no-checkout --filter=blob:none <url> - Read interesting data from the repository and put it in a database.
- Later, do
git fetch --prune --prune-tags --forceto update the data with the newest changes. - Read all data from the repository that is not yet in the database.
- Go to 3.
This works, but since I'm processing thousands of repos, even though I use --filter=blob:none, the repositories still take up a lot of disk space. (more than a terabyte). And I don't need that data. Once I have processed a git object (commit, tree or tag) I don't need it anymore.
Is there a way to delete all, or most of the objects in the repository, while keeping the ability to fetch changes? And also not have to fetch those objects again?
I've looked at shallow clones, promise files and replace references, but it's all very complicated and every command/option seems to be doing something that is just a little bit different from what I need.
rm -rf repo, then clone withblob:noneagain. Another option: you may also keep only a copy of the current packed-refs, and use that to mark all these commits as "shallow" (--depth=1) on next update.