TIL - removing git large objects from .git
clone a fresh copy from github
\( git clone --mirror <url>
detecting the large file
❯ dust --ignore-directory .venv/ --ignore-directory data/
8.0K ┌── README.md │█ │ 0%
8.0K ├── REFACTORING.md │█ │ 0%
8.0K │ ┌── augment.py │█ │ 0%
8.0K │ ├── core.py │█ │ 0%
8.0K │ ├── model.py │█ │ 0%
8.0K │ │ ┌── core.cpython-313.pyc │█ │ 0%
8.0K │ │ ├── model.cpython-313.pyc │█ │ 0%
12K │ │ ├── augment.cpython-313.pyc │█ │ 0%
36K │ ├─┴ __pycache__ │█ │ 0%
72K │ ┌─┴ deepaugment │█ │ 0%
72K ├─┴ src │█ │ 0%
248K ├── uv.lock │█ │ 0%
12K │ ┌── logs │█ │ 0%
16K │ ├── index │█ │ 0%
60K │ ├── hooks │█ │ 0%
76K │ │ ┌── pack-0542d6e993463b99ac69399306e79f7efc43457b.idx │█ │ 0%
80M │ │ ├── pack-0542d6e993463b99ac69399306e79f7efc43457b.pack│█ │ 99%
80M │ │ ┌─┴ pack │█ │ 99%
80M │ ├─┴ objects │█ │ 99%
80M ├─┴ .git │█ │ 100%
81M ┌─┴ . │█ │ 100%
.git/objects/hook/pack-0542d6e993463b99ac69399306e79f7efc43457b.pack file is 80M!
find historic large file paths
find object hashes of largest 10 historic files
git verify-pack -v .git/objects/pack/pack-081cab608fc2c70786413cfef1580dd9205e67e9.pack | sort -k3 -n | tail -10
for all hashes (i.e. object sha), find their historic paths
git rev-list --objects --all | grep "<object-sha-from-above>|<object-sha-from-above>|..."
save those to a new file. let's call it historic_large_file_paths.txt
Single command doing all together
\) git rev-list --objects --all | grep -E "\((git verify-pack -v .git/objects/pack/pack-0542d6e993463b99ac69399306e79f7efc43457b.pack | sort -k3 -n | tail -10 | awk '{print \)1}' | tr '\n' '|' | sed 's/|\(//')" | awk '{print \)2}' > historic_large_file_paths.txt
explanation (progressively)
- git rev-list --objects --all: list all objects in the repository
- grep -E <options>: grep extended regular expression
- grep -E "<option-1>|<option-2>|...": grep for multiple options
- "\((<any-command>)": run any command and use its output as a string
- grep -E "\)(<any-command>)": run any command and grep based on its output (above see the long command)
- git verify-pack -v .git/objects/pack/<pack-file>.pack: verify pack file in verbose mode, showing objects by sha and their sizes
- sort -k3 -n: sort by third column (-k3) in numeric order (-n). in our example, third column is the size of the object
- tail -10: show last 10 lines
- awk '{print \(1}': pick the first column. in our above example, first column is the sha of objects
- tr '\n' '|': replace newlines with pipe symbol
- sed 's/|\)//': remove trailing pipe symbol
- awk '{print \(2}': pick the second column. in our above example, second column is the path of objects
- > historic_large_file_paths.txt: save to file
remove selected files form git history
\) brew install git-filter-repo
$ git filter-repo --paths-from-file large_file_paths.txt --invert-paths