530

I was working on a repository on my GitHub account and this is a problem I stumbled upon.

  • Node.js project with a folder with a few npm packages installed
  • The packages were in node_modules folder
  • Added that folder to Git repository and pushed the code to GitHub (wasn't thinking about the npm part at that time)
  • Realized that you don't really need that folder to be a part of the code
  • Deleted that folder, pushed it

At that instance, the size of the total Git repository was around 6 MB where the actual code (all except that folder) was only around 300 KB.

Now what I am looking for in the end is a way to get rid of details of that package folder from Git's history, so if someone clones it, they don't have to download 6 MB worth of history where the only actual files they will be getting as of the last commit would be 300 KB.

I looked up possible solutions for this and tried these two methods

The Gist seemed like it worked where after running the script, it showed that it got rid of that folder and after that it showed that 50 different commits were modified. But it didn't let me push that code. When I tried to push it, it said Branch up to date, but it showed 50 commits were modified upon a git status. The other two methods didn't help either.

Now even though it showed that it got rid of that folder's history, when I checked the size of that repository on my localhost, it was still around 6 MB. (I also deleted the refs/originalfolder but didn't see the change in the size of the repository).

What I am looking to clarify is, if there's a way to get rid of not only the commit history (which is the only thing I think happened) but also those files Git is keeping assuming one wants to rollback.

Let’s say a solution is presented for this and is applied on my localhost, but it can’t be reproduced to that GitHub repository, is it possible to clone that repository, rollback to the first commit perform the trick and push it (or does that mean that Git will still have a history of all those commits? - AKA 6 MB).

My end goal here is to basically find the best way to get rid of the folder contents from Git so that a user doesn't have to download 6MB worth of stuff and still possibly have the other commits that never touched the modules folder (that's pretty much all of them) in Git's history.

How can I do this?

1

11 Answers 11

708

WARNING: git filter-branch is no longer officially recommended. The official recommendation is to use git-filter-repo; see André Anjos' answer for details.


If you are here to copy-paste code:

This is an example which removes node_modules from history

git filter-branch --tree-filter "rm -rf node_modules" --prune-empty HEAD
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
echo node_modules/ >> .gitignore
git add .gitignore
git commit -m 'Removing node_modules from git history'
git gc
git push origin main --force

What Git actually does:

The first line iterates through all references on the same tree (--tree-filter) as HEAD (your current branch), running the command rm -rf node_modules. This command deletes the node_modules folder (-r, without -r, rm won't delete folders), without any prompt given to the user (-f). The added --prune-empty deletes useless (not changing anything) commits recursively.

The second line deletes the reference to that old branch.

The rest of the commands are relatively straightforward.

Sign up to request clarification or add additional context in comments.

24 Comments

Just a side note: I used git count-objects -v to check if the files was actually removed but the size of the repository remains the same until I cloned the repository again. Git mantains a copy of all the original files I think.
With a non-ancient git, this should probably read --force-with-lease, not --force.
None of these commands work on windows. Or at least not Windows 10 please post the OS that the "cut and paste" works on
For Windows 10 users, this works nicely under Bash for Windows (I used Ubuntu)
I tried it with windows shell and with git bash, and did not work. First command pass, second command fail!
|
336

I find that the --tree-filter option used in other answers can be very slow, especially on larger repositories with lots of commits.

Here is the method I use to completely remove a directory from the Git history using the --index-filter option, which runs much quicker:

# Make a fresh clone of YOUR_REPO
git clone YOUR_REPO
cd YOUR_REPO

# Create tracking branches of all branches
for remote in `git branch -r | grep -v /HEAD`; do git checkout --track $remote ; done

# Remove DIRECTORY_NAME from all commits, then remove the refs to the old commits
# (repeat these two commands for as many directories that you want to remove)
git filter-branch --index-filter 'git rm -rf --cached --ignore-unmatch DIRECTORY_NAME/' --prune-empty --tag-name-filter cat -- --all
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

# Ensure all old refs are fully removed
rm -Rf .git/logs .git/refs/original

# Perform a garbage collection to remove commits with no refs
git gc --prune=all --aggressive

# Force push all branches to overwrite their history
# (use with caution!)
git push origin --all --force
git push origin --tags --force

You can check the size of the repository before and after the gc with:

git count-objects -vH

20 Comments

could you explain why this is much faster?
@knocte: from the docs (git-scm.com/docs/git-filter-branch). "--index-filter: ... is similar to the tree filter but does not check out the tree, which makes it much faster"
Why is this not the accepted answer? It is so thorough.
If doing this in Windows, you need double quotes instead of single quotes.
Passing --quiet to the git rm above sped up my rewrite at least by factor 4.
|
250

It appears that the up-to-date answer to this is to not use filter-branch directly (at least Git itself does not recommend it anymore), and defer that work to an external tool. In particular, git-filter-repo is currently recommended. The author of that tool provides arguments on why using filter-branch directly can lead to issues.

Most of the multi-line scripts above to remove dir from the history could be rewritten as:

git-filter-repo --path dir --invert-paths

The tool is more powerful than just that, apparently. You can apply filters by author, email, refname and more (full man page here). Furthermore, it is fast. Installation is easy; it is distributed in a variety of formats.

16 Comments

Nice tool! Works well on Ubuntu 20.04, you can just pip3 install git-filter-repo since it's stdlib-only and doesn't install any dependencies. On Ubuntu 18 it's incompatible with distro's git version Error: need a version of git whose diff-tree command has the --combined-all-paths option, but it's easy to enough to run it on a docker run -ti ubuntu:20.04
git: 'filter-repo' is not a git command. See 'git --help'.
Thanks for this, this was fast and finished in seconds! A couple notes on usage: 1) you may need to install a newer version of git. If you're on ubuntu that may require setting up a new apt repository as i.e. Xenial repos are still on git 2.7.4 which is too old. 2) This DOES delete the folder locally as well. Back it up if you need it. 3) You'll need to re-add the remote url and do a force push (as always, carefully!). 4) You can install the tool with pip3 easily (mentioned above). 5) You may need to run with --force if you don't want to clone a fresh repo. Seems to have gone fine for me.
The example should read git-filter-repo.py, not git filter-repo. It is not a native Git command.
On OS X, has Homebrew link support. brew install git-filter-repo
|
61

In addition to Mohsen's popular answer, I would like to add a few notes for Windows systems. The command

git filter-branch --tree-filter 'rm -rf node_modules' --prune-empty HEAD
  • works perfectly without any modification! Therefore, you must not use Remove-Item, del or anything else instead of rm -rf.

  • If you need to specify a path to a file or directory use slashes like ./path/to/node_modules

4 Comments

This will not work on Windows if the directory contains a . (dot) in the name.
And I found the solution. Use double inverted-commas for rm command like this: "rm -rf node.modules".
@CorneliuSerediuc bro just say quotation marks
this man really just called quote marks as double inverted commas
35

The best and most accurate method I found was to download the bfg.jar file: https://rtyley.github.io/bfg-repo-cleaner/

Then run the commands:

git clone --bare https://project/repository project-repository
cd project-repository
java -jar bfg.jar --delete-folders DIRECTORY_NAME
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --mirror https://project/new-repository

If you want to delete files then use the delete-files option instead:

java -jar bfg.jar --delete-files *.pyc

2 Comments

very easy :) if you want to make shure that only a specific folder is removed, this will help: stackoverflow.com/questions/21142986/…
But using BFG may have trouble when there are several folders that have the same name as the specific one you want to delete, i.e., BFG can not accept path name for --delete-folders.
8

Complete copy&paste recipe, just adding the commands in the comments (for the copy-paste solution), after testing them:

git filter-branch --tree-filter 'rm -rf node_modules' --prune-empty HEAD
echo node_modules/ >> .gitignore
git add .gitignore
git commit -m 'Removing node_modules from git history'
git gc
git push origin master --force

After this, you can remove the line "node_modules/" from .gitignore

1 Comment

Second the question... "After this, you can remove the line "node_modules/" from .gitignore" This line in the answer (answer... not git commit message) says you can remove node_modules/... but why would you?
8

For Windows user, please note to use " instead of ' Also added -f to force the command if another backup is already there.

git filter-branch -f --tree-filter "rm -rf FOLDERNAME" --prune-empty HEAD
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
echo FOLDERNAME/ >> .gitignore
git add .gitignore
git commit -m "Removing FOLDERNAME from git history"
git gc
git push origin master --force

Comments

2

I removed the bin and obj folders from old C# projects using Git on Windows. Be careful with

git filter-branch --tree-filter "rm -rf bin" --prune-empty HEAD

It destroys the integrity of the Git installation by deleting the usr/bin folder in the Git install folder.

Comments

1

For copypasters (from here):

git filter-repo --invert-paths --path PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA
echo "YOUR-FILE-WITH-SENSITIVE-DATA" >> .gitignore
git add .gitignore
git commit -m "Add YOUR-FILE-WITH-SENSITIVE-DATA to .gitignore"
git push origin --force --all

1 Comment

would git push origin --tags also be an useful integration at the end?
0

I wanted to use the git-filter-repo, so I did:

  • cd into usr/local/bin
  • touch git-filter-repo (the file does not have any extension or cannot change the name of the file.) and copy pasted the git-filter-repo code (you can use whatever command u wanna use to download that into your PATH)
  • cloned my target repo because that's what git-filter-repo works on. git clone --bare [email protected]:yourgit/yourepo.git
  • then, cd yourrepo
  • Run this command inside of that freshly cloned repo. git-filter-repo --path dirNameYouWantToDelete --invert-paths
  • git push origin master --force (if there is a better way I would like to know..)
  • then I went back to my old repo and stashed all my changes and git pull.

then it deleted my unwanted directory history and was able to .gitignore the unwanted dir.

Comments

0

If you want to remove nested files:

git filter-branch -f --tree-filter "find . -name '*.png' -type f -delete" --prune-empty HEAD

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.