"juicefs gc --delete" fall into infinite loop #5335

frostwind · 2024-12-02T17:09:22Z

What happened:
when running gc with "--delete" option , eg
"juicefs gc postgres://jfs_admin:'xxxx'@jfs_meta_url:5432/jfs --delete"
It fall into infinite loop as below.

2024/12/02 08:35:24.920074 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]
2024/12/02 08:35:24.920132 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]
2024/12/02 08:35:24.920480 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]
2024/12/02 08:35:24.920999 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]

What you expected to happen:
It should stop or skip the non-existed inode.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?
I am using below mount option during most of my test
juicefs mount -d -o allow_other --writeback --backup-meta 0 --buffer-size 2000 --cache-partial-only

jfs=# select * from jfs_node where inode=11496018;
 inode | type | flags | mode | uid | gid | atime | mtime | ctime | atimensec | mtimensec | ctimensec | nlink | length | rdev | parent | access_acl_id | default_acl_id 
-------+------+-------+------+-----+-----+-------+-------+-------+-----------+-----------+-----------+-------+--------+------+--------+---------------+----------------
(0 rows)

jfs=# select * from jfs_node where parent=11496018;
  inode   | type | flags | mode | uid | gid |      atime       |      mtime       |      ctime       | atimensec | mtimensec | ctimensec | nlink | length | rdev |  parent  | access_acl_id | default_acl_id 
----------+------+-------+------+-----+-----+------------------+------------------+------------------+-----------+-----------+-----------+-------+--------+------+----------+---------------+----------------
 25152661 |    2 |     0 |  493 |   0 |   0 | 1728496776274948 | 1728496776274948 | 1728496776274948 |       943 |       943 |       943 |   249 |   4096 |    0 | 11496018 |             0 |              0
(1 row)

[root@xxx jfs]# juicefs info -i 11496018
2024/12/02 08:40:42.871177 juicefs[18913] <FATAL>: info: no such file or directory [info.go:152]

[root@xxx jfs]# juicefs info -i 25152661
25152661 :
  inode: 25152661
  files: 0
   dirs: 19
 length: 1.75 MiB (1840395 Bytes)
   size: 2.73 MiB (2863104 Bytes)
   path: unknown

Create table broken_records as 
WITH RECURSIVE c AS (
   SELECT 11496018::bigint AS inode , 0::bigint as parent 
   UNION ALL
   SELECT sa.inode , sa.parent 
   FROM jfs_node AS sa
      JOIN c ON c.inode = sa.parent
)
 SELECT *  FROM c;

SELECT 1212423

I use above SQL to dump broken directory structure and saw it populate about 1.2M records.

From the timestamp , eg "mtime" =1728496776274948 , it is on Oct 9 2024, seems to belonging to a directory created by "juicefs clone". eg , as below , inode 25358790 is a directory with 100 files under it , 25358800 is one of the files under 25358790 , this size match how I create the test directory with 100 empty files under it. During my test , I create a directory "dir1" with about totally 20 million files , each layer has many subdirectories, each subdirectory has 100 direct empty files under it. After I have such "dir1" , then I use "juicefs clone" to clone dir1 to dir2. My best guess this broken inode is somehow related to the cloned directory.
I also tried "juicefs fsck --path / --repair --recursive" seems can not fix the issue.

jfs=# select count(*) from broken_records where parent=25358790;
 count 
-------
   100
(1 row)

jfs=# select parent,inode from broken_records where parent=25358790 limit 1;
  parent  |  inode   
----------+----------
 25358790 | 25358800
(1 row)


[root@xxx jfs]# juicefs info -i 25358790
25358790 :
  inode: 25358790
  files: 100
   dirs: 1
 length: 200 Bytes
   size: 404.00 KiB (413696 Bytes)
   path: unknown
[root@xxx jfs]# juicefs info -i  25358800
25358800 :
  inode: 25358800
  files: 1
   dirs: 0
 length: 2 Bytes
   size: 4.00 KiB (4096 Bytes)
   path: unknown
 objects:
+------------+---------------------------------+------+--------+--------+
| chunkIndex |            objectName           | size | offset | length |
+------------+---------------------------------+------+--------+--------+
|          0 | myjfs/chunks/6/6875/6875461_0_2 |    2 |      0 |      2 |
+------------+---------------------------------+------+--------+--------+

### further check distribution of the broken directory , most of them(11998) has 100 direct files/dirs under each of them , pretty match how I generate "dir1"  with about 20 million files. 
jfs=# select count(*),child_num from (select count(*) as child_num ,parent from broken_records group by parent order by count(*)) t group by child_num;
 count | child_num 
-------+-----------
     3 |         1
     1 |         7
     1 |        10
     1 |        13
     1 |        15
     1 |        18
     1 |        22
     1 |        23
     1 |        26
     2 |        35
     2 |        59
     1 |        60
     1 |        63
     1 |        68
     1 |        93
 11998 |       100
     2 |       603
     5 |       604
     1 |       605
     2 |       606
     1 |       607
     1 |       665
     1 |       672
     1 |       673
     1 |       675
     1 |       679
     2 |      1000
(27 rows)

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.2.1+2024-08-30.cd871d19
Cloud provider or hardware configuration running JuiceFS: on-prem hardware with ceph storage backend.
OS (e.g cat /etc/os-release): CentOS Linux release 7.9.2009 (Core)
Kernel (e.g. uname -a): 5.4.206-200.el7.x86_64 #1 SMP Thu Jul 28 14:58:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Object storage (cloud provider and region, or self maintained): Ceph , self hosted
Metadata engine info (version, cloud provider managed or self maintained): PostgreSQL 17.2
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): all local network in the same datacenter.
Others:

The text was updated successfully, but these errors were encountered:

frostwind added the kind/bug Something isn't working label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"juicefs gc --delete" fall into infinite loop #5335

"juicefs gc --delete" fall into infinite loop #5335

frostwind commented Dec 2, 2024 •

edited

Loading

"juicefs gc --delete" fall into infinite loop #5335

"juicefs gc --delete" fall into infinite loop #5335

Comments

frostwind commented Dec 2, 2024 • edited Loading

frostwind commented Dec 2, 2024 •

edited

Loading