Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sql failed when using replace function #9524

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

EricZequan
Copy link
Contributor

@EricZequan EricZequan commented Oct 14, 2024

What problem does this PR solve?

Issue Number: ref #9522

Problem Summary:

What is changed and how it works?

Tiflash only supports the replace function with the first parameter being of ColumnString type.

So I added type judgment to col_src in the replace execution function and converted col_src into the ColumnString type that can be parsed correctly.

At the same time, I added the following functions to handle the corresponding Const examples:

  • vectorConstSrcAndReplace- handle replace(Const, Column, Const)
  • vectorConstSrcAndNeedle- handle replace(Const, Const, Column)
  • vectorConstSrc- handle replace(Const, Column, Column)

The following are the test results of the relevant sql:

mysql> CREATE TABLE `tl7154cbef` (
    ->   `col_11` vector(1) NOT NULL,
    ->   `col_12` int(10) unsigned NOT NULL,
    ->   `col_13` vector NOT NULL,
    ->   `col_14` vector(3) NOT NULL,
    ->   VECTOR INDEX `idx_38_6` USING HNSW ((VEC_L2_DISTANCE(`col_11`)))
    -> ) ENGINE=InnoDB DEFAULT CHARSET=gbk COLLATE=gbk_chinese_ci;
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO `tl7154cbef` VALUES ('[0.902107]', 2011164482, '[0.828307,0.319142,0.696748,0.133167]', '[0.405775,0.866348,0.373082]');
Query OK, 1 row affected (0.01 sec)

mysql> WITH cte_331 (col_1805, col_1806, col_1807, col_1808, col_1809) AS (
    ->     SELECT 
    ->         /*+ read_from_storage(tiflash[tl7154cbef]) */ 
    ->         /*+ agg_to_cop() hash_agg() */ 
    ->         BIT_XOR(tl7154cbef.col_12) AS r0, 
    ->         GROUP_CONCAT(tl7154cbef.col_12 ORDER BY tl7154cbef.col_12) AS r1, 
    ->         CHARACTER_LENGTH(tl7154cbef.col_11) AS r2, 
    ->         COUNT(DISTINCT tl7154cbef.col_12) AS r3, 
    ->         REPLACE(tl7154cbef.col_13, tl7154cbef.col_11, tl7154cbef.col_13) AS r4 
    ->     FROM 
    ->         tl7154cbef 
    ->     WHERE 
    ->         tl7154cbef.col_13 IN ('[0.471552, 0.743432, 0.333821, 0.950423, 0.134729, 0.474538, 0.643419, 0.625898, 0.269346]') 
    ->     GROUP BY 
    ->         tl7154cbef.col_13, tl7154cbef.col_11 
    ->     ORDER BY 
    ->         r0, r1, r2, r3, r4 
    -> )
    -> SELECT 
    ->     1, col_1805, col_1806, col_1807, col_1808, col_1809 
    -> FROM 
    ->     cte_331 
    -> WHERE 
    ->     ISNULL(cte_331.col_1809) 
    -> ORDER BY 
    ->     1, 2, 3, 4, 5, 6 
    -> LIMIT 
    ->     21180486;
Empty set, 1 warning (0.10 sec)

mysql> explain WITH cte_331 (col_1805, col_1806, col_1807, col_1808, col_1809) AS (     SELECT          /*+ read_from_storage(tiflash[tl7154cbef]) */          /*+ agg_to_cop() hash_agg() */          BIT_XOR(tl7154cbef.col_12) AS r0,          GROUP_CONCAT(tl7154cbef.col_12 ORDER BY tl7154cbef.col_12) AS r1,          CHARACTER_LENGTH(tl7154cbef.col_11) AS r2,          COUNT(DISTINCT tl7154cbef.col_12) AS r3,          REPLACE(tl7154cbef.col_13, tl7154cbef.col_11, tl7154cbef.col_13) AS r4      FROM          tl7154cbef      WHERE          tl7154cbef.col_13 IN ('[0.471552, 0.743432, 0.333821, 0.950423, 0.134729, 0.474538, 0.643419, 0.625898, 0.269346]')      GROUP BY          tl7154cbef.col_13, tl7154cbef.col_11      ORDER BY          r0, r1, r2, r3, r4  ) SELECT      1, col_1805, col_1806, col_1807, col_1808, col_1809  FROM      cte_331  WHERE      ISNULL(cte_331.col_1809)  ORDER BY      1, 2, 3, 4, 5, 6  LIMIT      21180486;
+------------------------------------------+---------+--------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                       | estRows | task         | access object    | operator info                                                                                                                                                                                                                                                                                                        |
+------------------------------------------+---------+--------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection_15                            | 6.40    | root         |                  | 1->Column#26, Column#21, Column#22, Column#24, Column#23, Column#25                                                                                                                                                                                                                                                  |
| └─Projection_16                          | 6.40    | root         |                  | Column#21, Column#22, character_length(cast(test.tl7154cbef.col_11, var_string(1)))->Column#24, Column#23, replace(cast(test.tl7154cbef.col_13, var_string(5)), cast(test.tl7154cbef.col_11, var_string(1)), cast(test.tl7154cbef.col_13, var_string(5)))->Column#25                                                 |
|   └─Projection_27                        | 6.40    | root         |                  | Column#21, Column#22, Column#23, test.tl7154cbef.col_11, test.tl7154cbef.col_13                                                                                                                                                                                                                                      |
|     └─TopN_19                            | 6.40    | root         |                  | Column#21, Column#22, Column#35, Column#23, Column#36, offset:0, count:21180486                                                                                                                                                                                                                                      |
|       └─Projection_28                    | 6.40    | root         |                  | Column#21, Column#22, Column#23, test.tl7154cbef.col_11, test.tl7154cbef.col_13, character_length(cast(test.tl7154cbef.col_11, var_string(1)))->Column#35, replace(cast(test.tl7154cbef.col_13, var_string(5)), cast(test.tl7154cbef.col_11, var_string(1)), cast(test.tl7154cbef.col_13, var_string(5)))->Column#36 |
|         └─HashAgg_21                     | 6.40    | root         |                  | group by:Column#33, Column#34, funcs:bit_xor(Column#30)->Column#21, funcs:group_concat(Column#31 order by Column#30 separator ",")->Column#22, funcs:count(distinct Column#32)->Column#23, funcs:firstrow(Column#33)->test.tl7154cbef.col_11, funcs:firstrow(Column#34)->test.tl7154cbef.col_13                      |
|           └─Projection_26                | 8.00    | root         |                  | test.tl7154cbef.col_12->Column#30, cast(test.tl7154cbef.col_12, var_string(20))->Column#31, test.tl7154cbef.col_12->Column#32, test.tl7154cbef.col_11->Column#33, test.tl7154cbef.col_13->Column#34                                                                                                                  |
|             └─TableReader_25             | 8.00    | root         |                  | MppVersion: 2, data:ExchangeSender_24                                                                                                                                                                                                                                                                                |
|               └─ExchangeSender_24        | 8.00    | mpp[tiflash] |                  | ExchangeType: PassThrough                                                                                                                                                                                                                                                                                            |
|                 └─Selection_23           | 8.00    | mpp[tiflash] |                  | isnull(replace("[0.471552,0.743432,0.333821,0.950423,0.134729,0.474538,0.643419,0.625898,0.269346]", cast(test.tl7154cbef.col_11, var_string(1)), "[0.471552,0.743432,0.333821,0.950423,0.134729,0.474538,0.643419,0.625898,0.269346]"))                                                                             |
|                   └─TableFullScan_22     | 10.00   | mpp[tiflash] | table:tl7154cbef | pushed down filter:eq(test.tl7154cbef.col_13, [0.471552,0.743432,0.333821,0.950423,0.134729,0.474538,0.643419,0.625898,0.269346]), keep order:false, stats:pseudo                                                                                                                                                    |
+------------------------------------------+---------+--------------+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set, 3 warnings (0.01 sec)

other case:

mysql> CREATE TABLE my_table (
    ->          `col_11` VARCHAR(255) NOT NULL,
    ->          `col_12` int(10) unsigned NOT NULL,
    ->          `col_13` VARCHAR(255) NOT NULL,
    ->          `col_14` VARCHAR(255) NOT NULL
    ->      );
Query OK, 0 rows affected (0.04 sec)

mysql> insert into my_table values ('World',1203945,'Hello, World!','FOR TEST');
Query OK, 1 row affected (0.00 sec)

mysql> ALTER TABLE my_table SET TIFLASH REPLICA 1;
Query OK, 0 rows affected (0.04 sec)

mysql> SELECT /*+ read_from_storage(tiflash[my_table]) */ REPLACE('Hello World', my_table.col_11, my_table.col_14)  FROM my_table;
+----------------------------------------------------------+
| REPLACE('Hello World', my_table.col_11, my_table.col_14) |
+----------------------------------------------------------+
| Hello FOR TEST                                            |
+----------------------------------------------------------+
1 row in set (0.09 sec)

mysql> explain SELECT /*+ read_from_storage(tiflash[my_table]) */ REPLACE('Hello World', my_table.col_11, my_table.col_14)  FROM my_table;
+---------------------------+----------+--------------+----------------+----------------------------------------------------------------------------+
| id                        | estRows  | task         | access object  | operator info                                                              |
+---------------------------+----------+--------------+----------------+----------------------------------------------------------------------------+
| TableReader_10            | 10000.00 | root         |                | MppVersion: 2, data:ExchangeSender_9                                       |
| └─ExchangeSender_9        | 10000.00 | mpp[tiflash] |                | ExchangeType: PassThrough                                                  |
|   └─Projection_4          | 10000.00 | mpp[tiflash] |                | replace(Hello World, test.my_table.col_11, test.my_table.col_14)->Column#6 |
|     └─TableFullScan_8     | 10000.00 | mpp[tiflash] | table:my_table | keep order:false, stats:pseudo                                             |
+---------------------------+----------+--------------+----------------+----------------------------------------------------------------------------+
4 rows in set (0.00 sec)

mysql> SELECT /*+ read_from_storage(tiflash[my_table]) */ REPLACE('Hello World', my_table.col_11, 'forMYtest')  FROM my_table;
+------------------------------------------------------+
| REPLACE('Hello World', my_table.col_11, 'forMYtest') |
+------------------------------------------------------+
| Hello forMYtest                                       |
+------------------------------------------------------+
1 row in set (0.07 sec)

mysql> explain SELECT /*+ read_from_storage(tiflash[my_table]) */ REPLACE('Hello World', my_table.col_11, 'forMYtest')  FROM my_table;
+---------------------------+----------+--------------+----------------+-----------------------------------------------------------------+
| id                        | estRows  | task         | access object  | operator info                                                   |
+---------------------------+----------+--------------+----------------+-----------------------------------------------------------------+
| TableReader_10            | 10000.00 | root         |                | MppVersion: 2, data:ExchangeSender_9                            |
| └─ExchangeSender_9        | 10000.00 | mpp[tiflash] |                | ExchangeType: PassThrough                                       |
|   └─Projection_4          | 10000.00 | mpp[tiflash] |                | replace(Hello World, test.my_table.col_11, forMYtest)->Column#6 |
|     └─TableFullScan_8     | 10000.00 | mpp[tiflash] | table:my_table | keep order:false, stats:pseudo                                  |
+---------------------------+----------+--------------+----------------+-----------------------------------------------------------------+
4 rows in set (0.00 sec)



Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: “EricZequan” <[email protected]>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. labels Oct 14, 2024
Copy link
Contributor

ti-chi-bot bot commented Oct 14, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign calvinneo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/needs-linked-issue labels Oct 14, 2024
@EricZequan
Copy link
Contributor Author

/cc @breezewish

@ti-chi-bot ti-chi-bot bot requested a review from breezewish October 14, 2024 04:34
@@ -1063,6 +1063,166 @@ struct ReplaceStringImpl
}
}

// Handle the case where `column_src` and `replace` are const
static void vectorConstReplacement(
Copy link
Member

@breezewish breezewish Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify existing functions instead of introducing new ones? Existing functions are already capable of handling a list of sources. It should be better to allow it handling constant source (i.e. one source), minimizing the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these functions are necessary because the first parameter of existing functions is Column by default, which will be parsed into ColumnString in FunctionsStringSearch.cpp, and our implementation needs to parse the parameter into ColumnConst. ColumnConst does not have getChars and getOffsets methods, so existing functions cannot be used.

Copy link
Member

@breezewish breezewish Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense. Think about it. A function can now calculate a bunch of FN(A, B), why cannot calculate a single row of FN(ConstA, ConstB)? What we want to support is a subset of the current capability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found a way to convert between the two types, so that now the existing function can be used directly, and the amount of code is greatly reduced.

Comment on lines 130 to 157
if (!needle_const && replacement_const)
{
executeImplConstReplacement(
column_src,
column_needle,
column_replacement,
pos,
occ,
match_type,
column_result
);
}else if (!needle_const && !replacement_const)
{
executeImplConstFirstParaReplacement(
column_src,
column_needle,
column_replacement,
pos,
occ,
match_type,
column_result
);
}else
{
throw Exception(
"UnImplement function.",
ErrorCodes::BAD_ARGUMENTS);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly no need for a lot of impls because we are fixing for an edge case that only discovered in tests. It is acceptable to be not performance optimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my test case, some sql like :

explain SELECT /*+ read_from_storage(tiflash[my_table]) */ REPLACE('Hello World', my_table.col_11, my_table.col_14)  FROM my_table;

will get same error. I think it may need to process, so I add these function to deal with such problems. 😂

Copy link
Member

@breezewish breezewish Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I mean we could use a slower but more general impl for such cases.

Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 14, 2024
Signed-off-by: “EricZequan” <[email protected]>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 14, 2024
Signed-off-by: “EricZequan” <[email protected]>
Signed-off-by: “EricZequan” <[email protected]>
@EricZequan
Copy link
Contributor Author

/retest

1 similar comment
@EricZequan
Copy link
Contributor Author

/retest

Signed-off-by: “EricZequan” <[email protected]>
Copy link
Contributor

@gengliqi gengliqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add a const data version of vectorNonConstNeedle, vectorNonConstReplacement, and vectorNonConstNeedleReplacement.

Signed-off-by: “EricZequan” <[email protected]>
@EricZequan
Copy link
Contributor Author

It's better to add a const data version of vectorNonConstNeedle, vectorNonConstReplacement, and vectorNonConstNeedleReplacement.
To keep the code non-redundant, maybe we can not add it? 🤔

The processing logic of the constant type function is the same as the original function, just need to interpret constant as Column to process it, this change only needs a few dozen lines.

by the way, I have tried this method in the debug commit and add 2 function to complete replace for constant
commit, and the overall code is not elegant enough

Comment on lines 248 to 249
const auto & const_data = col_const->getDataColumn();
const auto * col = typeid_cast<const ColumnString *>(&const_data);
Copy link
Contributor

@gengliqi gengliqi Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrong. The length of this column is just 1. You will find some errors if you correctly add a test for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. Now I add vectorConstSrcAndReplacevectorConstSrcAndNeedlevectorConstSrc to handle the corresponding case to ensure the same number of rows.

Signed-off-by: “EricZequan” <[email protected]>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 15, 2024
Comment on lines 976 to 977
const ColumnString::Chars_t & data,
const ColumnString::Offsets & offsets,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data is const, how about using const std::string &?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. PTAL~

@breezewish
Copy link
Member

I'm still confused by why do we need almost 300 lines in order to make something work for a subset? It is wired.

Signed-off-by: “EricZequan” <[email protected]>
@EricZequan
Copy link
Contributor Author

I'm still confused by why do we need almost 300 lines in order to make something work for a subset? It is wired.

I understand what you mean. Constant can actually be seen as a cloumn with only one line. It is reasonable to reuse the original function as a subset.
However, liqi suggested adding a separate function to process. Emmm anyway, I will discuss it with him tomorrow~

Signed-off-by: “EricZequan” <[email protected]>
toVec({"Good Night", "Bad Afternoon", "Good Afterwhile"}),
executeFunction(
"replaceAll",
toVec({"Good Afternoon"}),
Copy link
Member

@breezewish breezewish Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be toConst? I suppose only const values could have different lengths (1) compared as other columns. And BTW this is exactly the case what we want to make it run correctly.

toVec({"Good Night", "Good Bad", "Good while"}),
executeFunction(
"replaceAll",
toVec({"Good Afternoon"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be toConst?

toVec({"Good Night", "Night Afternoon", "Good AfterNight"}),
executeFunction(
"replaceAll",
toVec({"Good Afternoon"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be toConst?

Copy link
Contributor

ti-chi-bot bot commented Oct 16, 2024

@EricZequan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test 464d990 link true /test pull-unit-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Comment on lines +987 to +988
auto data_col = ColumnString::create();
data_col->insert(data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this, this greatly reduce the performance.

@gengliqi
Copy link
Contributor

gengliqi commented Oct 16, 2024

I'm still confused by why do we need almost 300 lines in order to make something work for a subset? It is wired.

I understand what you mean. Constant can actually be seen as a cloumn with only one line. It is reasonable to reuse the original function as a subset. However, liqi suggested adding a separate function to process. Emmm anyway, I will discuss it with him tomorrow~

What I mean isn't to say that separate functions should be used, you can use templates to reduce repetitive code, and of course, also maintain good performance.

@breezewish
Copy link
Member

I created a new PR here: #9536

@EricZequan EricZequan marked this pull request as draft October 17, 2024 03:57
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 17, 2024
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants