Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"juicefs gc --delete" fall into infinite loop #5335

Open
frostwind opened this issue Dec 2, 2024 · 0 comments
Open

"juicefs gc --delete" fall into infinite loop #5335

frostwind opened this issue Dec 2, 2024 · 0 comments
Labels
kind/bug Something isn't working

Comments

@frostwind
Copy link

frostwind commented Dec 2, 2024

What happened:
when running gc with "--delete" option , eg
"juicefs gc postgres://jfs_admin:'xxxx'@jfs_meta_url:5432/jfs --delete"
It fall into infinite loop as below.

2024/12/02 08:35:24.920074 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]
2024/12/02 08:35:24.920132 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]
2024/12/02 08:35:24.920480 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]
2024/12/02 08:35:24.920999 juicefs[18479] <WARNING>: Get directory parent of inode 11496018: no such file or directory [quota.go:347]

What you expected to happen:
It should stop or skip the non-existed inode.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?
I am using below mount option during most of my test
juicefs mount -d -o allow_other --writeback --backup-meta 0 --buffer-size 2000 --cache-partial-only

jfs=# select * from jfs_node where inode=11496018;
 inode | type | flags | mode | uid | gid | atime | mtime | ctime | atimensec | mtimensec | ctimensec | nlink | length | rdev | parent | access_acl_id | default_acl_id 
-------+------+-------+------+-----+-----+-------+-------+-------+-----------+-----------+-----------+-------+--------+------+--------+---------------+----------------
(0 rows)

jfs=# select * from jfs_node where parent=11496018;
  inode   | type | flags | mode | uid | gid |      atime       |      mtime       |      ctime       | atimensec | mtimensec | ctimensec | nlink | length | rdev |  parent  | access_acl_id | default_acl_id 
----------+------+-------+------+-----+-----+------------------+------------------+------------------+-----------+-----------+-----------+-------+--------+------+----------+---------------+----------------
 25152661 |    2 |     0 |  493 |   0 |   0 | 1728496776274948 | 1728496776274948 | 1728496776274948 |       943 |       943 |       943 |   249 |   4096 |    0 | 11496018 |             0 |              0
(1 row)

[root@xxx jfs]# juicefs info -i 11496018
2024/12/02 08:40:42.871177 juicefs[18913] <FATAL>: info: no such file or directory [info.go:152]

[root@xxx jfs]# juicefs info -i 25152661
25152661 :
  inode: 25152661
  files: 0
   dirs: 19
 length: 1.75 MiB (1840395 Bytes)
   size: 2.73 MiB (2863104 Bytes)
   path: unknown

Create table broken_records as 
WITH RECURSIVE c AS (
   SELECT 11496018::bigint AS inode , 0::bigint as parent 
   UNION ALL
   SELECT sa.inode , sa.parent 
   FROM jfs_node AS sa
      JOIN c ON c.inode = sa.parent
)
 SELECT *  FROM c;

SELECT 1212423

I use above SQL to dump broken directory structure and saw it populate about 1.2M records.

From the timestamp , eg "mtime" =1728496776274948 , it is on Oct 9 2024, seems to belonging to a directory created by "juicefs clone". eg , as below , inode 25358790 is a directory with 100 files under it , 25358800 is one of the files under 25358790 , this size match how I create the test directory with 100 empty files under it. During my test , I create a directory "dir1" with about totally 20 million files , each layer has many subdirectories, each subdirectory has 100 direct empty files under it. After I have such "dir1" , then I use "juicefs clone" to clone dir1 to dir2. My best guess this broken inode is somehow related to the cloned directory.
I also tried "juicefs fsck --path / --repair --recursive" seems can not fix the issue.

jfs=# select count(*) from broken_records where parent=25358790;
 count 
-------
   100
(1 row)

jfs=# select parent,inode from broken_records where parent=25358790 limit 1;
  parent  |  inode   
----------+----------
 25358790 | 25358800
(1 row)


[root@xxx jfs]# juicefs info -i 25358790
25358790 :
  inode: 25358790
  files: 100
   dirs: 1
 length: 200 Bytes
   size: 404.00 KiB (413696 Bytes)
   path: unknown
[root@xxx jfs]# juicefs info -i  25358800
25358800 :
  inode: 25358800
  files: 1
   dirs: 0
 length: 2 Bytes
   size: 4.00 KiB (4096 Bytes)
   path: unknown
 objects:
+------------+---------------------------------+------+--------+--------+
| chunkIndex |            objectName           | size | offset | length |
+------------+---------------------------------+------+--------+--------+
|          0 | myjfs/chunks/6/6875/6875461_0_2 |    2 |      0 |      2 |
+------------+---------------------------------+------+--------+--------+

### further check distribution of the broken directory , most of them(11998) has 100 direct files/dirs under each of them , pretty match how I generate "dir1"  with about 20 million files. 
jfs=# select count(*),child_num from (select count(*) as child_num ,parent from broken_records group by parent order by count(*)) t group by child_num;
 count | child_num 
-------+-----------
     3 |         1
     1 |         7
     1 |        10
     1 |        13
     1 |        15
     1 |        18
     1 |        22
     1 |        23
     1 |        26
     2 |        35
     2 |        59
     1 |        60
     1 |        63
     1 |        68
     1 |        93
 11998 |       100
     2 |       603
     5 |       604
     1 |       605
     2 |       606
     1 |       607
     1 |       665
     1 |       672
     1 |       673
     1 |       675
     1 |       679
     2 |      1000
(27 rows)

Environment:

  • JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.2.1+2024-08-30.cd871d19

  • Cloud provider or hardware configuration running JuiceFS: on-prem hardware with ceph storage backend.

  • OS (e.g cat /etc/os-release): CentOS Linux release 7.9.2009 (Core)

  • Kernel (e.g. uname -a): 5.4.206-200.el7.x86_64 #1 SMP Thu Jul 28 14:58:01 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Object storage (cloud provider and region, or self maintained): Ceph , self hosted

  • Metadata engine info (version, cloud provider managed or self maintained): PostgreSQL 17.2

  • Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): all local network in the same datacenter.

  • Others:

@frostwind frostwind added the kind/bug Something isn't working label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant