-
Notifications
You must be signed in to change notification settings - Fork 316
Make max_stream_count
configurable when using Bigquery Storage API
#2030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Using the default setting, in the worst-case scenario, for n-download streams, we would have to store
|
I am putting together a PR for this. We have some logic to allow those two to interact in a way that makes sense and hopefully doesn't break backward compatibility. We are also issuing a docstring to explain the method as a whole (this feels necessary since we have a new argument that may interact with another argument). More to come, soon. |
Right, you can't use more than 1 stream if you need to preserve row order when using ORDER BY, the concurrency will screw up the output. |
Alright. Got a draft PR. Still working on the unit tests. |
@kien-truong let me know if the updates meet the need. |
Thanks @chalmerlowe, can you also add the python-bigquery/google/cloud/bigquery/table.py Line 1811 in 7372ad6
python-bigquery/google/cloud/bigquery/table.py Line 1976 in 7372ad6
Those 2 methods are where this is most useful to support incremental fetching use cases. |
I can give it a try, but I won't be able to tackle it today. I have a few other things on my plate. Will keep you posted. |
Adds a function `determine_requested_streams()` to compare `preserve_order` and the new argument `max_stream_count` to determine how many streams to request. ``` preserve_order (bool): Whether to preserve the order of streams. If True, this limits the number of streams to one (more than one cannot guarantee order). max_stream_count (Union[int, None]]): The maximum number of streams allowed. Must be a non-negative number or None, where None indicates the value is unset. If `max_stream_count` is set, it overrides `preserve_order`. ``` Fixes #2030 🦕
The original changes to this issue have been merged to python-bigquery version 3.27.0. I have not had the bandwidth to take the additional requests here. |
The original changes to this have been released to PYPI as 3.27.0. Will see about adding the other changes when my schedule opens up. |
Cheer, I've already created the PR for that here #2051 |
Currently, for API that can use BQ Storage Client to fetch data like
to_dataframe_iterable
orto_arrow_iterable
, the client library always uses the maximum number of read streams recommended by BQ server.python-bigquery/google/cloud/bigquery/_pandas_helpers.py
Line 840 in ef8e927
python-bigquery/google/cloud/bigquery/_pandas_helpers.py
Lines 854 to 858 in ef8e927
This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.
BQ Storage Client API also suggests capping
max_stream_count
when resource is constrainedhttps://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest
This problem has been encountered by others before and can be worked-around by monkey-patching the
create_read_session
on the BQ Client object: #1292However, it should really be fixed by allowing the
max_stream_count
parameter to be set through public API.The text was updated successfully, but these errors were encountered: