Questions about query execution and optimization

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about query execution and optimization

Hohl, Gerrit
Hello everyone,

this week I came across a behaviour of Apache Derby which I couldn't explain to myself.
I'm using version 10.14.2.0 and the structure of the database looks like this:

CREATE TABLE license (
        id BIGINT NOT NULL,
        [...]
);
CREATE TABLE installation (
        id BIGINT NOT NULL,
        [...]
        license_id BIGINT NOT NULL,
        [...]
);
CREATE TABLE log (
        id BIGINT NOT NULL,
        action VARCHAR(255) NOT NULL,
        create_timestamp TIMESTAMP NOT NULL,
        entity_cls VARCHAR(255),
        entity_id BIGINT,
        type INTEGER NOT NULL,
        message VARCHAR(32672) NOT NULL,
        PRIMARY KEY (ID)
);
CREATE INDEX logcreatetimestampindex ON log (create_timestamp);
CREATE INDEX logentityclsentityidindex ON log (entity_cls, entity_id);
ALTER TABLE installation ADD CONSTRAINT fkinstallationlicense FOREIGN KEY (license_id) REFERENCES license (id);

The log table contains log messages about actions on records in other tables.
Because it should be useable with all tables, there can't be any foreign keys.
Especially as also logs should be kept when the corresponding record is deleted.

Now I wanted to create a query which returns all logs of a certain license and its installations:

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
OR ((l.entity_cls = 'Installation) AND (l.entity_id IN (
SELECT i.id FROM installation i WHERE (i.license_id = ?)
)))
ORDER BY l.create_timestamp DESC, l.id DESC;

But that thing took forever (~ 12.5s).
I thought the reason would maybe my index, so I introduced two more:

CREATE INDEX logentityidindex ON log (entity_cls);
CREATE INDEX logentityclsindex ON log (entity_id);

Unfortunately that didn't change much. The query took almost the exact same amount of time.
Next thing was removing the sub-select and directly giving a list of IDs for the installation records - just for testing.

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = 123))
OR ((l.entity_cls = 'Installation) AND (l.entity_id IN (234, 345)))
ORDER BY l.create_timestamp DESC, l.id DESC;

Much to my amazement that also didn't change anything.
In black despair I split the query into two while keeping the sub-select:

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
ORDER BY l.create_timestamp DESC, l.id DESC;

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'Installation) AND (l.entity_id IN (
SELECT i.id FROM installation.id WHERE (l.license_id = ?)
)))
ORDER BY l.create_timestamp DESC, l.id DESC;

Unbelievable: The 1st query took ~0.15s while the 2nd query took ~0.2s.
As the sub-select seems not to have any effect how fast or slow the query is, that can be ignored.
In all cases I only have columns in my WHERE clause which are in the indexes.
But somehow Apache Derby seems not to notice it and not using them, if the WHERE clause gets too complicated.

Is there anything I haven't seen? Anything I'm doing wrong by structuring the query like I did?
Somehow I don't get it why the execution time gets 100 times slower just by having everything in one query.

Regards,
Gerrit

Reply | Threaded
Open this post in threaded view
|

AW: Questions about query execution and optimization

Hohl, Gerrit
Hello everyone,


me again.
In the meantime I tried to tune the table using the

        SYSCS_UTIL.SYSCS_COMPRESS_TABLE

and

        SYSCS_UTIL.SYSCS_UPDATE_STATISTICS

functions. It didn't have any effect on the query runtime.
I also tried the query analyse functions:


        CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
        CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
        <Query>
        VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
        CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(0);
        CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(0);

The result was long, localized (German in my case - nice)... and absolutely cryptic.
It didn't mention any of the tokens of my SQL query.
Instead it was taking about UNIONs and JOINs, also I don't use any. At least not explicitly.
I couldn't figure at all which entry of the analysis belonged to which part of the SQL query.

I remember when I worked with PostgreSQL there was that neat EXPLAIN command.
That command had a very good structure of its output.
And it seems they still have it in their current version:
https://www.postgresql.org/docs/current/using-explain.html


Regards,
Gerrit

-----Ursprüngliche Nachricht-----
Von: Hohl, Gerrit <[hidden email]>
Gesendet: Freitag, 3. Juli 2020 16:22
An: Derby Discussion <[hidden email]>
Betreff: Questions about query execution and optimization

Hello everyone,

this week I came across a behaviour of Apache Derby which I couldn't explain to myself.
I'm using version 10.14.2.0 and the structure of the database looks like this:

CREATE TABLE license (
        id BIGINT NOT NULL,
        [...]
);
CREATE TABLE installation (
        id BIGINT NOT NULL,
        [...]
        license_id BIGINT NOT NULL,
        [...]
);
CREATE TABLE log (
        id BIGINT NOT NULL,
        action VARCHAR(255) NOT NULL,
        create_timestamp TIMESTAMP NOT NULL,
        entity_cls VARCHAR(255),
        entity_id BIGINT,
        type INTEGER NOT NULL,
        message VARCHAR(32672) NOT NULL,
        PRIMARY KEY (ID)
);
CREATE INDEX logcreatetimestampindex ON log (create_timestamp);
CREATE INDEX logentityclsentityidindex ON log (entity_cls, entity_id);
ALTER TABLE installation ADD CONSTRAINT fkinstallationlicense FOREIGN KEY (license_id) REFERENCES license (id);

The log table contains log messages about actions on records in other tables.
Because it should be useable with all tables, there can't be any foreign keys.
Especially as also logs should be kept when the corresponding record is deleted.

Now I wanted to create a query which returns all logs of a certain license and its installations:

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
OR ((l.entity_cls = 'Installation) AND (l.entity_id IN (
SELECT i.id FROM installation i WHERE (i.license_id = ?)
)))
ORDER BY l.create_timestamp DESC, l.id DESC;

But that thing took forever (~ 12.5s).
I thought the reason would maybe my index, so I introduced two more:

CREATE INDEX logentityidindex ON log (entity_cls);
CREATE INDEX logentityclsindex ON log (entity_id);

Unfortunately that didn't change much. The query took almost the exact same amount of time.
Next thing was removing the sub-select and directly giving a list of IDs for the installation records - just for testing.

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = 123))
OR ((l.entity_cls = 'Installation) AND (l.entity_id IN (234, 345)))
ORDER BY l.create_timestamp DESC, l.id DESC;

Much to my amazement that also didn't change anything.
In black despair I split the query into two while keeping the sub-select:

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
ORDER BY l.create_timestamp DESC, l.id DESC;

SELECT l.*
FROM log l
WHERE ((l.entity_cls = 'Installation) AND (l.entity_id IN (
SELECT i.id FROM installation.id WHERE (l.license_id = ?)
)))
ORDER BY l.create_timestamp DESC, l.id DESC;

Unbelievable: The 1st query took ~0.15s while the 2nd query took ~0.2s.
As the sub-select seems not to have any effect how fast or slow the query is, that can be ignored.
In all cases I only have columns in my WHERE clause which are in the indexes.
But somehow Apache Derby seems not to notice it and not using them, if the WHERE clause gets too complicated.

Is there anything I haven't seen? Anything I'm doing wrong by structuring the query like I did?
Somehow I don't get it why the execution time gets 100 times slower just by having everything in one query.

Regards,
Gerrit

Reply | Threaded
Open this post in threaded view
|

Re: AW: Questions about query execution and optimization

Rick Hillegas-3
Hi Gerrit,

I suspect that your query performs poorly because your indexes do not
cover the query. That means that you are selecting columns which don't
appear in the indexes. In this case, the optimizer knows that Derby
cannot satisfy the query by simply reading index pages. Derby also has
to read base table pages in order to fetch the other columns. Please see
http://db.apache.org/derby/docs/10.15/tuning/ctunoptimz30768.html

Derby query plans are not easy to read. If you are snapshotting plans
with XPLAIN style, then the details go into snapshot tables. In theory,
you could write a tool to read those snapshot results and format them
better.

Hope this helps,
-Rick

On 7/6/20 12:46 AM, Hohl, Gerrit wrote:

> Hello everyone,
>
>
> me again.
> In the meantime I tried to tune the table using the
>
> SYSCS_UTIL.SYSCS_COMPRESS_TABLE
>
> and
>
> SYSCS_UTIL.SYSCS_UPDATE_STATISTICS
>
> functions. It didn't have any effect on the query runtime.
> I also tried the query analyse functions:
>
>
> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
> <Query>
> VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(0);
> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(0);
>
> The result was long, localized (German in my case - nice)... and absolutely cryptic.
> It didn't mention any of the tokens of my SQL query.
> Instead it was taking about UNIONs and JOINs, also I don't use any. At least not explicitly.
> I couldn't figure at all which entry of the analysis belonged to which part of the SQL query.
>
> I remember when I worked with PostgreSQL there was that neat EXPLAIN command.
> That command had a very good structure of its output.
> And it seems they still have it in their current version:
> https://www.postgresql.org/docs/current/using-explain.html
>
>
> Regards,
> Gerrit
>
> -----Ursprüngliche Nachricht-----
> Von: Hohl, Gerrit <[hidden email]>
> Gesendet: Freitag, 3. Juli 2020 16:22
> An: Derby Discussion <[hidden email]>
> Betreff: Questions about query execution and optimization
>
> Hello everyone,
>
> this week I came across a behaviour of Apache Derby which I couldn't explain to myself.
> I'm using version 10.14.2.0 and the structure of the database looks like this:
>
> CREATE TABLE license (
> id BIGINT NOT NULL,
> [...]
> );
> CREATE TABLE installation (
> id BIGINT NOT NULL,
> [...]
> license_id BIGINT NOT NULL,
> [...]
> );
> CREATE TABLE log (
> id BIGINT NOT NULL,
> action VARCHAR(255) NOT NULL,
> create_timestamp TIMESTAMP NOT NULL,
> entity_cls VARCHAR(255),
> entity_id BIGINT,
> type INTEGER NOT NULL,
> message VARCHAR(32672) NOT NULL,
> PRIMARY KEY (ID)
> );
> CREATE INDEX logcreatetimestampindex ON log (create_timestamp);
> CREATE INDEX logentityclsentityidindex ON log (entity_cls, entity_id);
> ALTER TABLE installation ADD CONSTRAINT fkinstallationlicense FOREIGN KEY (license_id) REFERENCES license (id);
>
> The log table contains log messages about actions on records in other tables.
> Because it should be useable with all tables, there can't be any foreign keys.
> Especially as also logs should be kept when the corresponding record is deleted.
>
> Now I wanted to create a query which returns all logs of a certain license and its installations:
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
> OR ((l.entity_cls = 'Installation) AND (l.entity_id IN (
> SELECT i.id FROM installation i WHERE (i.license_id = ?)
> )))
> ORDER BY l.create_timestamp DESC, l.id DESC;
>
> But that thing took forever (~ 12.5s).
> I thought the reason would maybe my index, so I introduced two more:
>
> CREATE INDEX logentityidindex ON log (entity_cls);
> CREATE INDEX logentityclsindex ON log (entity_id);
>
> Unfortunately that didn't change much. The query took almost the exact same amount of time.
> Next thing was removing the sub-select and directly giving a list of IDs for the installation records - just for testing.
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'License') AND (l.entity_id = 123))
> OR ((l.entity_cls = 'Installation) AND (l.entity_id IN (234, 345)))
> ORDER BY l.create_timestamp DESC, l.id DESC;
>
> Much to my amazement that also didn't change anything.
> In black despair I split the query into two while keeping the sub-select:
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
> ORDER BY l.create_timestamp DESC, l.id DESC;
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'Installation) AND (l.entity_id IN (
> SELECT i.id FROM installation.id WHERE (l.license_id = ?)
> )))
> ORDER BY l.create_timestamp DESC, l.id DESC;
>
> Unbelievable: The 1st query took ~0.15s while the 2nd query took ~0.2s.
> As the sub-select seems not to have any effect how fast or slow the query is, that can be ignored.
> In all cases I only have columns in my WHERE clause which are in the indexes.
> But somehow Apache Derby seems not to notice it and not using them, if the WHERE clause gets too complicated.
>
> Is there anything I haven't seen? Anything I'm doing wrong by structuring the query like I did?
> Somehow I don't get it why the execution time gets 100 times slower just by having everything in one query.
>
> Regards,
> Gerrit
>

Reply | Threaded
Open this post in threaded view
|

AW: AW: Questions about query execution and optimization

Hohl, Gerrit
Hi Rick,


thanks for your message.


Thanks also for the link, also I'm not sure I understand its content correctly.
My understand of indexes so far was that they are used to filter tables when performing a query.
Let's say we have a table PERSON containing the name and age of persons. We have an index on the age column.
In our query we want all persons which are 30 years old: SELECT * FROM person WHERE age = 30;
I would expect that the DBMS - after it analysed the query - will look for all "30" entries in the age index.
Those contain pointers on the original records / tuples in the table file.
Let's say we have 10 matches in the age index, the DBMS will get those 10 matches from the table file.
Of course, if all columns we select are already in the index - in the best case we have "SELECT age FROM person WHERE age = 30;" - the query would be extremely fast.
And it would be slower if additional columns which are not backed by an index can also be found in the WHERE clause.

But even set that aside: In my 1st mail I selected the same columns in the big query as well as in the two smaller queries.
There was no difference between them in this aspect.
The difference was in the WHERE clause - which was more complicated in the big query and a simpler in the two smaller queries.
But in all versions the same columns were used in the WHERE clause, all of them backed by indexes.
So it won't also explain the behaviour.


About the XPLAIN style things: I guess you're talking about this here.
https://db.apache.org/derby/docs/10.14/tuning/ctun_xplain_style.html
Seems like a neat little project on its own...


Regards,
Gerrit

-----Ursprüngliche Nachricht-----
Von: Rick Hillegas <[hidden email]>
Gesendet: Montag, 6. Juli 2020 14:54
An: Derby Discussion <[hidden email]>; Hohl, Gerrit <[hidden email]>
Betreff: Re: AW: Questions about query execution and optimization

Hi Gerrit,

I suspect that your query performs poorly because your indexes do not cover the query. That means that you are selecting columns which don't appear in the indexes. In this case, the optimizer knows that Derby cannot satisfy the query by simply reading index pages. Derby also has to read base table pages in order to fetch the other columns. Please see http://db.apache.org/derby/docs/10.15/tuning/ctunoptimz30768.html

Derby query plans are not easy to read. If you are snapshotting plans with XPLAIN style, then the details go into snapshot tables. In theory, you could write a tool to read those snapshot results and format them better.

Hope this helps,
-Rick

On 7/6/20 12:46 AM, Hohl, Gerrit wrote:

> Hello everyone,
>
>
> me again.
> In the meantime I tried to tune the table using the
>
> SYSCS_UTIL.SYSCS_COMPRESS_TABLE
>
> and
>
> SYSCS_UTIL.SYSCS_UPDATE_STATISTICS
>
> functions. It didn't have any effect on the query runtime.
> I also tried the query analyse functions:
>
>
> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
> <Query>
> VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(0);
> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(0);
>
> The result was long, localized (German in my case - nice)... and absolutely cryptic.
> It didn't mention any of the tokens of my SQL query.
> Instead it was taking about UNIONs and JOINs, also I don't use any. At least not explicitly.
> I couldn't figure at all which entry of the analysis belonged to which part of the SQL query.
>
> I remember when I worked with PostgreSQL there was that neat EXPLAIN command.
> That command had a very good structure of its output.
> And it seems they still have it in their current version:
> https://www.postgresql.org/docs/current/using-explain.html
>
>
> Regards,
> Gerrit
>
> -----Ursprüngliche Nachricht-----
> Von: Hohl, Gerrit <[hidden email]>
> Gesendet: Freitag, 3. Juli 2020 16:22
> An: Derby Discussion <[hidden email]>
> Betreff: Questions about query execution and optimization
>
> Hello everyone,
>
> this week I came across a behaviour of Apache Derby which I couldn't explain to myself.
> I'm using version 10.14.2.0 and the structure of the database looks like this:
>
> CREATE TABLE license (
> id BIGINT NOT NULL,
> [...]
> );
> CREATE TABLE installation (
> id BIGINT NOT NULL,
> [...]
> license_id BIGINT NOT NULL,
> [...]
> );
> CREATE TABLE log (
> id BIGINT NOT NULL,
> action VARCHAR(255) NOT NULL,
> create_timestamp TIMESTAMP NOT NULL,
> entity_cls VARCHAR(255),
> entity_id BIGINT,
> type INTEGER NOT NULL,
> message VARCHAR(32672) NOT NULL,
> PRIMARY KEY (ID)
> );
> CREATE INDEX logcreatetimestampindex ON log (create_timestamp); CREATE
> INDEX logentityclsentityidindex ON log (entity_cls, entity_id); ALTER
> TABLE installation ADD CONSTRAINT fkinstallationlicense FOREIGN KEY
> (license_id) REFERENCES license (id);
>
> The log table contains log messages about actions on records in other tables.
> Because it should be useable with all tables, there can't be any foreign keys.
> Especially as also logs should be kept when the corresponding record is deleted.
>
> Now I wanted to create a query which returns all logs of a certain license and its installations:
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?)) OR
> ((l.entity_cls = 'Installation) AND (l.entity_id IN ( SELECT i.id FROM
> installation i WHERE (i.license_id = ?)
> )))
> ORDER BY l.create_timestamp DESC, l.id DESC;
>
> But that thing took forever (~ 12.5s).
> I thought the reason would maybe my index, so I introduced two more:
>
> CREATE INDEX logentityidindex ON log (entity_cls); CREATE INDEX
> logentityclsindex ON log (entity_id);
>
> Unfortunately that didn't change much. The query took almost the exact same amount of time.
> Next thing was removing the sub-select and directly giving a list of IDs for the installation records - just for testing.
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'License') AND (l.entity_id = 123)) OR
> ((l.entity_cls = 'Installation) AND (l.entity_id IN (234, 345))) ORDER
> BY l.create_timestamp DESC, l.id DESC;
>
> Much to my amazement that also didn't change anything.
> In black despair I split the query into two while keeping the sub-select:
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?)) ORDER BY
> l.create_timestamp DESC, l.id DESC;
>
> SELECT l.*
> FROM log l
> WHERE ((l.entity_cls = 'Installation) AND (l.entity_id IN ( SELECT
> i.id FROM installation.id WHERE (l.license_id = ?)
> )))
> ORDER BY l.create_timestamp DESC, l.id DESC;
>
> Unbelievable: The 1st query took ~0.15s while the 2nd query took ~0.2s.
> As the sub-select seems not to have any effect how fast or slow the query is, that can be ignored.
> In all cases I only have columns in my WHERE clause which are in the indexes.
> But somehow Apache Derby seems not to notice it and not using them, if the WHERE clause gets too complicated.
>
> Is there anything I haven't seen? Anything I'm doing wrong by structuring the query like I did?
> Somehow I don't get it why the execution time gets 100 times slower just by having everything in one query.
>
> Regards,
> Gerrit
>

Reply | Threaded
Open this post in threaded view
|

Re: AW: AW: Questions about query execution and optimization

Rick Hillegas-3
Hi Gerrit,

One thing I notice about your big WHERE clause is that it contains an
OR. In general, ORs are not optimizable. You will end of with full table
scans. Please see
http://db.apache.org/derby/docs/10.15/tuning/ctunoptimz39106.html

Hope this helps,
-Rick

On 7/6/20 6:21 AM, Hohl, Gerrit wrote:

> Hi Rick,
>
>
> thanks for your message.
>
>
> Thanks also for the link, also I'm not sure I understand its content correctly.
> My understand of indexes so far was that they are used to filter tables when performing a query.
> Let's say we have a table PERSON containing the name and age of persons. We have an index on the age column.
> In our query we want all persons which are 30 years old: SELECT * FROM person WHERE age = 30;
> I would expect that the DBMS - after it analysed the query - will look for all "30" entries in the age index.
> Those contain pointers on the original records / tuples in the table file.
> Let's say we have 10 matches in the age index, the DBMS will get those 10 matches from the table file.
> Of course, if all columns we select are already in the index - in the best case we have "SELECT age FROM person WHERE age = 30;" - the query would be extremely fast.
> And it would be slower if additional columns which are not backed by an index can also be found in the WHERE clause.
>
> But even set that aside: In my 1st mail I selected the same columns in the big query as well as in the two smaller queries.
> There was no difference between them in this aspect.
> The difference was in the WHERE clause - which was more complicated in the big query and a simpler in the two smaller queries.
> But in all versions the same columns were used in the WHERE clause, all of them backed by indexes.
> So it won't also explain the behaviour.
>
>
> About the XPLAIN style things: I guess you're talking about this here.
> https://db.apache.org/derby/docs/10.14/tuning/ctun_xplain_style.html
> Seems like a neat little project on its own...
>
>
> Regards,
> Gerrit
>
> -----Ursprüngliche Nachricht-----
> Von: Rick Hillegas <[hidden email]>
> Gesendet: Montag, 6. Juli 2020 14:54
> An: Derby Discussion <[hidden email]>; Hohl, Gerrit <[hidden email]>
> Betreff: Re: AW: Questions about query execution and optimization
>
> Hi Gerrit,
>
> I suspect that your query performs poorly because your indexes do not cover the query. That means that you are selecting columns which don't appear in the indexes. In this case, the optimizer knows that Derby cannot satisfy the query by simply reading index pages. Derby also has to read base table pages in order to fetch the other columns. Please see http://db.apache.org/derby/docs/10.15/tuning/ctunoptimz30768.html
>
> Derby query plans are not easy to read. If you are snapshotting plans with XPLAIN style, then the details go into snapshot tables. In theory, you could write a tool to read those snapshot results and format them better.
>
> Hope this helps,
> -Rick
>
> On 7/6/20 12:46 AM, Hohl, Gerrit wrote:
>> Hello everyone,
>>
>>
>> me again.
>> In the meantime I tried to tune the table using the
>>
>> SYSCS_UTIL.SYSCS_COMPRESS_TABLE
>>
>> and
>>
>> SYSCS_UTIL.SYSCS_UPDATE_STATISTICS
>>
>> functions. It didn't have any effect on the query runtime.
>> I also tried the query analyse functions:
>>
>>
>> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
>> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
>> <Query>
>> VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
>> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(0);
>> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(0);
>>
>> The result was long, localized (German in my case - nice)... and absolutely cryptic.
>> It didn't mention any of the tokens of my SQL query.
>> Instead it was taking about UNIONs and JOINs, also I don't use any. At least not explicitly.
>> I couldn't figure at all which entry of the analysis belonged to which part of the SQL query.
>>
>> I remember when I worked with PostgreSQL there was that neat EXPLAIN command.
>> That command had a very good structure of its output.
>> And it seems they still have it in their current version:
>> https://www.postgresql.org/docs/current/using-explain.html
>>
>>
>> Regards,
>> Gerrit
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Hohl, Gerrit <[hidden email]>
>> Gesendet: Freitag, 3. Juli 2020 16:22
>> An: Derby Discussion <[hidden email]>
>> Betreff: Questions about query execution and optimization
>>
>> Hello everyone,
>>
>> this week I came across a behaviour of Apache Derby which I couldn't explain to myself.
>> I'm using version 10.14.2.0 and the structure of the database looks like this:
>>
>> CREATE TABLE license (
>> id BIGINT NOT NULL,
>> [...]
>> );
>> CREATE TABLE installation (
>> id BIGINT NOT NULL,
>> [...]
>> license_id BIGINT NOT NULL,
>> [...]
>> );
>> CREATE TABLE log (
>> id BIGINT NOT NULL,
>> action VARCHAR(255) NOT NULL,
>> create_timestamp TIMESTAMP NOT NULL,
>> entity_cls VARCHAR(255),
>> entity_id BIGINT,
>> type INTEGER NOT NULL,
>> message VARCHAR(32672) NOT NULL,
>> PRIMARY KEY (ID)
>> );
>> CREATE INDEX logcreatetimestampindex ON log (create_timestamp); CREATE
>> INDEX logentityclsentityidindex ON log (entity_cls, entity_id); ALTER
>> TABLE installation ADD CONSTRAINT fkinstallationlicense FOREIGN KEY
>> (license_id) REFERENCES license (id);
>>
>> The log table contains log messages about actions on records in other tables.
>> Because it should be useable with all tables, there can't be any foreign keys.
>> Especially as also logs should be kept when the corresponding record is deleted.
>>
>> Now I wanted to create a query which returns all logs of a certain license and its installations:
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?)) OR
>> ((l.entity_cls = 'Installation) AND (l.entity_id IN ( SELECT i.id FROM
>> installation i WHERE (i.license_id = ?)
>> )))
>> ORDER BY l.create_timestamp DESC, l.id DESC;
>>
>> But that thing took forever (~ 12.5s).
>> I thought the reason would maybe my index, so I introduced two more:
>>
>> CREATE INDEX logentityidindex ON log (entity_cls); CREATE INDEX
>> logentityclsindex ON log (entity_id);
>>
>> Unfortunately that didn't change much. The query took almost the exact same amount of time.
>> Next thing was removing the sub-select and directly giving a list of IDs for the installation records - just for testing.
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'License') AND (l.entity_id = 123)) OR
>> ((l.entity_cls = 'Installation) AND (l.entity_id IN (234, 345))) ORDER
>> BY l.create_timestamp DESC, l.id DESC;
>>
>> Much to my amazement that also didn't change anything.
>> In black despair I split the query into two while keeping the sub-select:
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?)) ORDER BY
>> l.create_timestamp DESC, l.id DESC;
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'Installation) AND (l.entity_id IN ( SELECT
>> i.id FROM installation.id WHERE (l.license_id = ?)
>> )))
>> ORDER BY l.create_timestamp DESC, l.id DESC;
>>
>> Unbelievable: The 1st query took ~0.15s while the 2nd query took ~0.2s.
>> As the sub-select seems not to have any effect how fast or slow the query is, that can be ignored.
>> In all cases I only have columns in my WHERE clause which are in the indexes.
>> But somehow Apache Derby seems not to notice it and not using them, if the WHERE clause gets too complicated.
>>
>> Is there anything I haven't seen? Anything I'm doing wrong by structuring the query like I did?
>> Somehow I don't get it why the execution time gets 100 times slower just by having everything in one query.
>>
>> Regards,
>> Gerrit
>>

Reply | Threaded
Open this post in threaded view
|

AW: AW: AW: Questions about query execution and optimization

Hohl, Gerrit
Hi Rick,


ah, okay, that was the missing piece of puzzle.
And as I mentioned PostgreSQL before: It seems that ORs are a general problem.
https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/

But it seems there are some techniques or patterns to avoid them. One of them is using an UNION:

SELECT l.id, l.action, l.create_timestamp, l.entity_cls, l.entity_id, l.type, l.message
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
ORDER BY l.create_timestamp DESC, l.id DESC;

--> Total: 0.187s

SELECT l.id, l.action, l.create_timestamp, l.entity_cls, l.entity_id, l.type, l.message
FROM log l
WHERE ((l.entity_cls = 'Installation') AND (l.entity_id IN (
SELECT i.id FROM installation i WHERE (i.license_id = ?)
)))
ORDER BY l.create_timestamp DESC, l.id DESC;

--> Total: 0.407s

SELECT l.id, l.action, l.create_timestamp, l.entity_cls, l.entity_id, l.type, l.message
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
OR ((l.entity_cls = 'Installation') AND (l.entity_id IN (
SELECT i.id FROM installation i WHERE (i.license_id = ?)
)))
ORDER BY l.create_timestamp DESC, l.id DESC;

--> Total: 23.190s

SELECT l.id, l.action, l.create_timestamp, l.entity_cls, l.entity_id, l.type, l.message
FROM log l
WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?))
UNION
SELECT l.id, l.action, l.create_timestamp, l.entity_cls, l.entity_id, l.type, l.message
FROM log l
WHERE ((l.entity_cls = 'Installation') AND (l.entity_id IN (
SELECT i.id FROM installation i WHERE (i.license_id = ?)
)))
ORDER BY create_timestamp DESC, id DESC;

--> Total: 0.624s

Not bad, I would say. Maybe one remark: I used also aliases in the ORDER BY clause.
But you can't as you work on the result of the UNION which only consists of the columns without their aliases.

Thanks a lot, Rick. That helped me a lot.


Regards
Gerrit

-----Ursprüngliche Nachricht-----
Von: Rick Hillegas <[hidden email]>
Gesendet: Dienstag, 7. Juli 2020 00:48
An: Derby Discussion <[hidden email]>; Hohl, Gerrit <[hidden email]>
Betreff: Re: AW: AW: Questions about query execution and optimization

Hi Gerrit,

One thing I notice about your big WHERE clause is that it contains an OR. In general, ORs are not optimizable. You will end of with full table scans. Please see http://db.apache.org/derby/docs/10.15/tuning/ctunoptimz39106.html

Hope this helps,
-Rick

On 7/6/20 6:21 AM, Hohl, Gerrit wrote:

> Hi Rick,
>
>
> thanks for your message.
>
>
> Thanks also for the link, also I'm not sure I understand its content correctly.
> My understand of indexes so far was that they are used to filter tables when performing a query.
> Let's say we have a table PERSON containing the name and age of persons. We have an index on the age column.
> In our query we want all persons which are 30 years old: SELECT * FROM
> person WHERE age = 30; I would expect that the DBMS - after it analysed the query - will look for all "30" entries in the age index.
> Those contain pointers on the original records / tuples in the table file.
> Let's say we have 10 matches in the age index, the DBMS will get those 10 matches from the table file.
> Of course, if all columns we select are already in the index - in the best case we have "SELECT age FROM person WHERE age = 30;" - the query would be extremely fast.
> And it would be slower if additional columns which are not backed by an index can also be found in the WHERE clause.
>
> But even set that aside: In my 1st mail I selected the same columns in the big query as well as in the two smaller queries.
> There was no difference between them in this aspect.
> The difference was in the WHERE clause - which was more complicated in the big query and a simpler in the two smaller queries.
> But in all versions the same columns were used in the WHERE clause, all of them backed by indexes.
> So it won't also explain the behaviour.
>
>
> About the XPLAIN style things: I guess you're talking about this here.
> https://db.apache.org/derby/docs/10.14/tuning/ctun_xplain_style.html
> Seems like a neat little project on its own...
>
>
> Regards,
> Gerrit
>
> -----Ursprüngliche Nachricht-----
> Von: Rick Hillegas <[hidden email]>
> Gesendet: Montag, 6. Juli 2020 14:54
> An: Derby Discussion <[hidden email]>; Hohl, Gerrit
> <[hidden email]>
> Betreff: Re: AW: Questions about query execution and optimization
>
> Hi Gerrit,
>
> I suspect that your query performs poorly because your indexes do not
> cover the query. That means that you are selecting columns which don't
> appear in the indexes. In this case, the optimizer knows that Derby
> cannot satisfy the query by simply reading index pages. Derby also has
> to read base table pages in order to fetch the other columns. Please
> see http://db.apache.org/derby/docs/10.15/tuning/ctunoptimz30768.html
>
> Derby query plans are not easy to read. If you are snapshotting plans with XPLAIN style, then the details go into snapshot tables. In theory, you could write a tool to read those snapshot results and format them better.
>
> Hope this helps,
> -Rick
>
> On 7/6/20 12:46 AM, Hohl, Gerrit wrote:
>> Hello everyone,
>>
>>
>> me again.
>> In the meantime I tried to tune the table using the
>>
>> SYSCS_UTIL.SYSCS_COMPRESS_TABLE
>>
>> and
>>
>> SYSCS_UTIL.SYSCS_UPDATE_STATISTICS
>>
>> functions. It didn't have any effect on the query runtime.
>> I also tried the query analyse functions:
>>
>>
>> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
>> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(1);
>> <Query>
>> VALUES SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
>> CALL SYSCS_UTIL.SYSCS_SET_STATISTICS_TIMING(0);
>> CALL SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(0);
>>
>> The result was long, localized (German in my case - nice)... and absolutely cryptic.
>> It didn't mention any of the tokens of my SQL query.
>> Instead it was taking about UNIONs and JOINs, also I don't use any. At least not explicitly.
>> I couldn't figure at all which entry of the analysis belonged to which part of the SQL query.
>>
>> I remember when I worked with PostgreSQL there was that neat EXPLAIN command.
>> That command had a very good structure of its output.
>> And it seems they still have it in their current version:
>> https://www.postgresql.org/docs/current/using-explain.html
>>
>>
>> Regards,
>> Gerrit
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Hohl, Gerrit <[hidden email]>
>> Gesendet: Freitag, 3. Juli 2020 16:22
>> An: Derby Discussion <[hidden email]>
>> Betreff: Questions about query execution and optimization
>>
>> Hello everyone,
>>
>> this week I came across a behaviour of Apache Derby which I couldn't explain to myself.
>> I'm using version 10.14.2.0 and the structure of the database looks like this:
>>
>> CREATE TABLE license (
>> id BIGINT NOT NULL,
>> [...]
>> );
>> CREATE TABLE installation (
>> id BIGINT NOT NULL,
>> [...]
>> license_id BIGINT NOT NULL,
>> [...]
>> );
>> CREATE TABLE log (
>> id BIGINT NOT NULL,
>> action VARCHAR(255) NOT NULL,
>> create_timestamp TIMESTAMP NOT NULL,
>> entity_cls VARCHAR(255),
>> entity_id BIGINT,
>> type INTEGER NOT NULL,
>> message VARCHAR(32672) NOT NULL,
>> PRIMARY KEY (ID)
>> );
>> CREATE INDEX logcreatetimestampindex ON log (create_timestamp);
>> CREATE INDEX logentityclsentityidindex ON log (entity_cls,
>> entity_id); ALTER TABLE installation ADD CONSTRAINT
>> fkinstallationlicense FOREIGN KEY
>> (license_id) REFERENCES license (id);
>>
>> The log table contains log messages about actions on records in other tables.
>> Because it should be useable with all tables, there can't be any foreign keys.
>> Especially as also logs should be kept when the corresponding record is deleted.
>>
>> Now I wanted to create a query which returns all logs of a certain license and its installations:
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?)) OR
>> ((l.entity_cls = 'Installation) AND (l.entity_id IN ( SELECT i.id
>> FROM installation i WHERE (i.license_id = ?)
>> )))
>> ORDER BY l.create_timestamp DESC, l.id DESC;
>>
>> But that thing took forever (~ 12.5s).
>> I thought the reason would maybe my index, so I introduced two more:
>>
>> CREATE INDEX logentityidindex ON log (entity_cls); CREATE INDEX
>> logentityclsindex ON log (entity_id);
>>
>> Unfortunately that didn't change much. The query took almost the exact same amount of time.
>> Next thing was removing the sub-select and directly giving a list of IDs for the installation records - just for testing.
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'License') AND (l.entity_id = 123)) OR
>> ((l.entity_cls = 'Installation) AND (l.entity_id IN (234, 345)))
>> ORDER BY l.create_timestamp DESC, l.id DESC;
>>
>> Much to my amazement that also didn't change anything.
>> In black despair I split the query into two while keeping the sub-select:
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'License') AND (l.entity_id = ?)) ORDER BY
>> l.create_timestamp DESC, l.id DESC;
>>
>> SELECT l.*
>> FROM log l
>> WHERE ((l.entity_cls = 'Installation) AND (l.entity_id IN ( SELECT
>> i.id FROM installation.id WHERE (l.license_id = ?)
>> )))
>> ORDER BY l.create_timestamp DESC, l.id DESC;
>>
>> Unbelievable: The 1st query took ~0.15s while the 2nd query took ~0.2s.
>> As the sub-select seems not to have any effect how fast or slow the query is, that can be ignored.
>> In all cases I only have columns in my WHERE clause which are in the indexes.
>> But somehow Apache Derby seems not to notice it and not using them, if the WHERE clause gets too complicated.
>>
>> Is there anything I haven't seen? Anything I'm doing wrong by structuring the query like I did?
>> Somehow I don't get it why the execution time gets 100 times slower just by having everything in one query.
>>
>> Regards,
>> Gerrit
>>