With >100k records, things are breaking šŸ˜•

Working with a DB having ~160k records, I am consistently having problems like:

  • Views will not load; "Canceling statement due to statement timeout"
  • Cannot perform even fairly simple graphQL queries with a limit of 20-100 records; "The upstream server is timing out"
  • Loading a DB setup page or a Rule page can take minutes, causing browser to repeatedly prompt that ā€œpage is unresponsiveā€ before eventually loadingā€¦ maybe.
  • The UI for all Fibery pages grinds to a halt and uses up all memory

I am faced with the simple task of deleting a few thousand records, but can find no way to easily accomplish this, because everything I try simply times out or gets canceled due to the above issues.

Additionally, when these errors occur, there is no feedback about whether the aborted operation was completely or partially executed, or rolled-back, or ā€¦?

E.g. the graphQL API call below might succeed if the limit is <80, but otherwise will consistently return ā€œupstream server is timing outā€:

mutation {
    callStats(
        period: { name: {is:"Day"} }
        limit:  100)
    { delete { message } }
}

Deleting a few thousand records defined by such a simple filter should be easy. But It is impossible without resorting to complex API-call loops.

The obvious thing would be to use a Rule to delete old records, but this does not work because:

  • Rules also suffer from the same timeout issues.
  • There is no way to use LIMIT with Rules (unless in a script)
  • Rules only run max once per hour - not enough if each API call is limited to deleting 50 records and there are tens of thousands to delete.

I have resorted to looping a local script to repeatedly send a graphQL query to delete 50-70 records at a time - each query can take 20-60 seconds to finish, or it might just return an error, with no indication of what (if anything) if actually accomplished.

Is this really the best we can do in a modern ā€œno codeā€ platform?
It should not take days to delete a few thousand records.



P.S. I did ask for help with this issue via Chat, but that was 3 days ago and I have received no help.
So, posting here.

3 Likes

I canā€™t see an unanswered chat from you, but I can see a chat where we are still digging into the problem. I expect @Kseniya_Piotuh will get back to you when there is news.
It certainly sounds like there is something funny going on, beyond what should be expected for a large database.

Iā€™m also keen to understand the architectural decisions behind Fibery. When a system like fibery struggles with rendering pages or performing tasks already with 100k records, its a serious scalability or performance optimization issue.
Specifically, how have you designed and optimized the system to efficiently handle large datasets and address potential performance bottlenecks? For example:

  1. Challenges:
  • Handling performance slowdowns associated with data growth.
  • Addressing physical storage limitations.
  • Managing resource-intensive complex queries.
  1. Data Management Strategies:
  • Techniques to optimize database queries.
  • Implementation of connection pools.
  • Use of database indexing and caching.
  • Approaches to avoid complex joins and reduce the number of database queries.
  • Methods for data compression to reduce storage space and improve retrieval times.
  1. User Experience and Data Retrieval:
  • Use of pagination to manage data display.
  • Implementation of lazy loading for on-demand data retrieval.
  1. Scaling Approaches:
  • Vertical scaling strategies to enhance a single serverā€™s capabilities.
  • Horizontal scaling methods, including the use of replication and sharding.
  1. Scaling Complexities:
  • How you handle query routing in sharded systems.
  • Ensuring data consistency in replicated databases.
  • Managing the increased administrative tasks associated with scaling.
  1. Infrastructure and Hardware:
  • Strategies for hardware and infrastructure scaling, including cloud solutions.
  • Techniques for archiving old, infrequently accessed data.
  1. Monitoring:
  • Techniques to pinpoint performance bottlenecks.
  • Strategies to ensure optimal resource usage.

Thank you in advance for shedding light on these aspects.

I donā€™t think you can expect to get an in-depth answer to this.
Fibery has a team of experienced developers, and the system has been designed with various performance criteria in mind, including with large datasets.
We are considering making some constraints/limitations explicit, so that users can set their expectations at the correct level.
We are actively reviewing the case that @Matt_Blais has raised.

4 Likes

Hello, @Matt_Blais

Thanks for feedback. Regarding the issue related to GraphQL upstreaming errors, I would like to notify you that we are working on functionality allowing to start and monitor background jobs for mentioned tasks. Looks like it is an only way in the nearest future to execute long running processes without failures. Will keep you informed about that. Sorry for inconvenience.

Thanks,
Oleg

2 Likes

Hi, @Matt_Blais

We implemented the way to execute GraphQL mutations as background job to avoid timeouts.
You can find documentation here: Fibery GraphQL API
Please let me know how it goes if you will have a chance to try.

I know, it doesnā€™t solve all of your issues, but we donā€™t give up and will proceed with automations performance tuning to improve things there.

Thanks,
Oleg

3 Likes

Thanks @Oleg - that is a great step forward. :grinning:

This is what I need to do to clean up and delete older entities:

  • I have ~586k ā€œCallā€ records which are linked to ~35k ā€œCall Statsā€ records.
  • I will use a long-running graphQL call to unlink 450k Call records from Stats.
  • Every unlink triggers a Stats Rule which adds the unlinked Callā€™s values to the Stats.
  • When I do this manually, I can see the Stats values increasing as the Rule processes through all the unlinked Calls (about 10-20 per second).

I have no problem waiting a long time for everything to finish, but this will create a very deep queue of unlinked Calls for the Rule to process, and if the Rules do not all eventually finish successfully, that would be a problem (data inconsistency).

Should I wait to start this process? Or do you believe the system can currently handle this scenario correctly?

NVM, I think I figured it out ā€“ see below.

@Oleg Iā€™m having trouble understanding the syntax for the new graphQL capability - itā€™s not clear how to specify the filter/search criteria for selecting the records to be mutated.

Could you please translate this ā€œnormalā€ graphQL query to work in the extended mode?
Note that the filtered Calls are first updated, then unlinked:

mutation {
  calls(
    callStartTime:		{ less:	"2023-10-11" }
    callStats:   		{ isEmpty: 	false }
    status:				{ isNull: 	true }
    limit: 1 offset: 0
  )
 	{ update( status: "Freeze")		{ message entities{id} }
    unlinkCallStats 				{ message }
  }
}

Solution:

mutation {
  calls (
    callStartTime:	{ less:	"2023-10-11" }
    callStats:		{ isEmpty: 	false }
    status:			{ isNull: 	true }
    limit: 1 offset: 0
  )
  { executeAsBackgroundJob {
      jobId actions {
        update( status: "Freeze" ) {message entities{id}}
        unlinkCallStats            {message}
      }
    }
  }
}

Comment on executeAsBackgroundJob:

It appears that the results ultimately returned from a completed background job can only return the id of affected entities, but no other fields.

Because I cannot see the Name, Public Id, etc, of affected entities, it is difficult to verify which entities were actually mutated. This makes testing/verifying a batch query difficult.

Could you add a query to find the entities (based on the IDs returned)?

Can a mutation also query on the returned Ids?

Or do you mean just do a separate query after the mutation is done, to get related entity info for the Idā€™s returned by the mutation?

I meant the latter

1 Like

If (1) the entities were only mutated and not deleted, and (2) I wasnā€™t so lazy,

then Yes, I could do that :stuck_out_tongue_winking_eye:

Well I suppose I wonder why you think you need to

[quote=ā€œMatt_Blais, post:9, topic:5237ā€]
verify which entities were actually mutated
[/quote]?

Are you concerned that the API call is not doing what you ask it to do? Or are you concerned that youā€™ve written the filter criteria incorrectly?

And just for info, the full IDs of entities are available on the GUI if you did want to check a sample of them.

Yes, both.

Hello, @Matt_Blais

It appears that the results ultimately returned from a completed background job can only return the id of affected entities, but no other fields.

It works in the same way for regular mutations. Unfortunately due to high memory consumption we can return only ids since hundreds of entities can have place.

Thanks,
Oleg

1 Like

@Oleg, in reference to my post above:

I am seeing that the Unlink Rule is having some trouble when 1000 entities are unlinked:

How can I determine what is necessary for this Rule to work smoothly when many entities are unlinked at once (via graphQL or a different Rule)?

Hi, @Matt_Blais

I am seeing that the Unlink Rule is having some trouble when 1000 entities are unlinked

We are working on tuning automations rules for large amounts of data. Will publish fixes in nearest future and hope it will help to avoid having such issues.

Thanks,
Oleg

3 Likes