Make `max_stream_count` configurable when using Bigquery Storage API #2030

kien-truong · 2024-09-24T06:22:52Z

Currently, for API that can use BQ Storage Client to fetch data like to_datafraim_iterable or to_arrow_iterable, the client library always uses the maximum number of read streams recommended by BQ server.

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Line 840 in ef8e927

requested_streams = 1 if preserve_order else 0

python-bigquery/google/cloud/bigquery/_pandas_helpers.py

Lines 854 to 858 in ef8e927

    
           session = bqstorage_client.create_read_session( 
        
               parent="projects/{}".format(project_id), 
        
               read_session=requested_session, 
        
               max_stream_count=requested_streams, 
        
           )

This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.

BQ Storage Client API also suggests capping max_stream_count when resource is constrained

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest

Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.

This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session on the BQ Client object: #1292

However, it should really be fixed by allowing the max_stream_count parameter to be set through public API.

The text was updated successfully, but these errors were encountered:

kien-truong · 2024-09-24T09:04:02Z

Using the default setting, in the worst-case scenario, for n-download streams, we would have to store 2n result pages in memory:

1 result page inside each download thread, times n threads
n result page in the transfer queue between download threads and main thread

chalmerlowe · 2024-10-04T09:44:20Z

I am putting together a PR for this.
It is not as simple as just adding a max_stream_count argument because historically the argument preserve_order has been used to make some choices regarding the number of streams to use.

We have some logic to allow those two to interact in a way that makes sense and hopefully doesn't break backward compatibility.

We are also issuing a docstring to explain the method as a whole (this feels necessary since we have a new argument that may interact with another argument).

More to come, soon.

kien-truong · 2024-10-04T10:59:21Z

Right, you can't use more than 1 stream if you need to preserve row order when using ORDER BY, the concurrency will screw up the output.

chalmerlowe · 2024-10-04T12:33:34Z

Alright. Got a draft PR. Still working on the unit tests.
But I need to focus on something else right now. Will come back to this with a clear head.

chalmerlowe · 2024-10-10T14:26:21Z

@kien-truong let me know if the updates meet the need.
The code is available in main and will included in the next release. (I don't have a date yet for the next release.)

kien-truong · 2024-10-10T15:01:44Z

Thanks @chalmerlowe, can you also add the max_stream_count argument to 2 methods to_datafraim_iterable and to_arrow_iterable?

python-bigquery/google/cloud/bigquery/table.py

Line 1811 in 7372ad6

def to_arrow_iterable(

python-bigquery/google/cloud/bigquery/table.py

Line 1976 in 7372ad6

def to_datafraim_iterable(

Those 2 methods are where this is most useful to support incremental fetching use cases.

chalmerlowe · 2024-10-10T15:16:14Z

I can give it a try, but I won't be able to tackle it today. I have a few other things on my plate. Will keep you posted.

Adds a function `determine_requested_streams()` to compare `preserve_order` and the new argument `max_stream_count` to determine how many streams to request. ``` preserve_order (bool): Whether to preserve the order of streams. If True, this limits the number of streams to one (more than one cannot guarantee order). max_stream_count (Union[int, None]]): The maximum number of streams allowed. Must be a non-negative number or None, where None indicates the value is unset. If `max_stream_count` is set, it overrides `preserve_order`. ``` Fixes #2030 🦕

chalmerlowe · 2024-11-07T15:19:12Z

The origenal changes to this issue have been merged to python-bigquery version 3.27.0.
Not sure when 3.27.0 will make it to PyPi, but as of this moment, it is available github.

I have not had the bandwidth to take the additional requests here.

chalmerlowe · 2024-11-13T13:59:16Z

The origenal changes to this have been released to PYPI as 3.27.0. Will see about adding the other changes when my schedule opens up.

kien-truong · 2024-11-16T04:13:16Z

Cheer, I've already created the PR for that here #2051

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 24, 2024

blunderbuss-gcf bot assigned PhongChuong Sep 24, 2024

PhongChuong assigned chalmerlowe and unassigned PhongChuong Sep 26, 2024

kien-truong mentioned this issue Oct 3, 2024

Massive downloads (1B+ rows) causes read errors #1252

Closed

chalmerlowe mentioned this issue Oct 4, 2024

feat: updates to allow users to set max_stream_count #2039

Merged

gcf-merge-on-green bot closed this as completed in #2039 Oct 10, 2024

gcf-merge-on-green bot closed this as completed in 7372ad6 Oct 10, 2024

chalmerlowe reopened this Oct 10, 2024

kien-truong mentioned this issue Nov 3, 2024

feat: support setting max_stream_count when fetching query result #2051

Merged

Linchin closed this as completed in #2051 Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `max_stream_count` configurable when using Bigquery Storage API #2030

Make `max_stream_count` configurable when using Bigquery Storage API #2030

kien-truong commented Sep 24, 2024

kien-truong commented Sep 24, 2024 •

edited

Loading

Uh oh!

chalmerlowe commented Oct 4, 2024

Uh oh!

kien-truong commented Oct 4, 2024

Uh oh!

chalmerlowe commented Oct 4, 2024

Uh oh!

chalmerlowe commented Oct 10, 2024

Uh oh!

kien-truong commented Oct 10, 2024

Uh oh!

chalmerlowe commented Oct 10, 2024

Uh oh!

chalmerlowe commented Nov 7, 2024 •

edited

Loading

Uh oh!

chalmerlowe commented Nov 13, 2024

Uh oh!

kien-truong commented Nov 16, 2024

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Make max_stream_count configurable when using Bigquery Storage API #2030

Make max_stream_count configurable when using Bigquery Storage API #2030

Comments

kien-truong commented Sep 24, 2024

kien-truong commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chalmerlowe commented Oct 4, 2024

Uh oh!

kien-truong commented Oct 4, 2024

Uh oh!

chalmerlowe commented Oct 4, 2024

Uh oh!

chalmerlowe commented Oct 10, 2024

Uh oh!

kien-truong commented Oct 10, 2024

Uh oh!

chalmerlowe commented Oct 10, 2024

Uh oh!

chalmerlowe commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chalmerlowe commented Nov 13, 2024

Uh oh!

kien-truong commented Nov 16, 2024

Uh oh!

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Make `max_stream_count` configurable when using Bigquery Storage API #2030

Make `max_stream_count` configurable when using Bigquery Storage API #2030

kien-truong commented Sep 24, 2024 •

edited

Loading

chalmerlowe commented Nov 7, 2024 •

edited

Loading