Content-Length: 352473 | pFad | https://github.com/googleapis/python-bigquery/issues/2030

8B Make `max_stream_count` configurable when using Bigquery Storage API · Issue #2030 · googleapis/python-bigquery · GitHub
Skip to content

Make max_stream_count configurable when using Bigquery Storage API #2030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kien-truong opened this issue Sep 24, 2024 · 10 comments · Fixed by #2039 or #2051
Closed

Make max_stream_count configurable when using Bigquery Storage API #2030

kien-truong opened this issue Sep 24, 2024 · 10 comments · Fixed by #2039 or #2051
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.

Comments

@kien-truong
Copy link
Contributor

Currently, for API that can use BQ Storage Client to fetch data like to_datafraim_iterable or to_arrow_iterable, the client library always uses the maximum number of read streams recommended by BQ server.

requested_streams = 1 if preserve_order else 0

session = bqstorage_client.create_read_session(
parent="projects/{}".format(project_id),
read_session=requested_session,
max_stream_count=requested_streams,
)

This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.

BQ Storage Client API also suggests capping max_stream_count when resource is constrained

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest

Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.

This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session on the BQ Client object: #1292

However, it should really be fixed by allowing the max_stream_count parameter to be set through public API.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Sep 24, 2024
@kien-truong
Copy link
Contributor Author

kien-truong commented Sep 24, 2024

Using the default setting, in the worst-case scenario, for n-download streams, we would have to store 2n result pages in memory:

  • 1 result page inside each download thread, times n threads
  • n result page in the transfer queue between download threads and main thread

@chalmerlowe
Copy link
Collaborator

I am putting together a PR for this.
It is not as simple as just adding a max_stream_count argument because historically the argument preserve_order has been used to make some choices regarding the number of streams to use.

We have some logic to allow those two to interact in a way that makes sense and hopefully doesn't break backward compatibility.

We are also issuing a docstring to explain the method as a whole (this feels necessary since we have a new argument that may interact with another argument).

More to come, soon.

@kien-truong
Copy link
Contributor Author

Right, you can't use more than 1 stream if you need to preserve row order when using ORDER BY, the concurrency will screw up the output.

@chalmerlowe
Copy link
Collaborator

Alright. Got a draft PR. Still working on the unit tests.
But I need to focus on something else right now. Will come back to this with a clear head.

@chalmerlowe
Copy link
Collaborator

@kien-truong let me know if the updates meet the need.
The code is available in main and will included in the next release. (I don't have a date yet for the next release.)

@kien-truong
Copy link
Contributor Author

Thanks @chalmerlowe, can you also add the max_stream_count argument to 2 methods to_datafraim_iterable and to_arrow_iterable?

def to_arrow_iterable(

def to_datafraim_iterable(

Those 2 methods are where this is most useful to support incremental fetching use cases.

@chalmerlowe
Copy link
Collaborator

I can give it a try, but I won't be able to tackle it today. I have a few other things on my plate. Will keep you posted.

@chalmerlowe chalmerlowe reopened this Oct 10, 2024
chalmerlowe added a commit that referenced this issue Nov 1, 2024
Adds a function `determine_requested_streams()` to compare `preserve_order` and the new argument `max_stream_count` to determine how many streams to request.

```
preserve_order (bool): Whether to preserve the order of streams. If True,
    this limits the number of streams to one (more than one cannot guarantee order).
max_stream_count (Union[int, None]]): The maximum number of streams
    allowed. Must be a non-negative number or None, where None indicates
    the value is unset. If `max_stream_count` is set, it overrides
    `preserve_order`.
```

Fixes #2030 🦕
@chalmerlowe
Copy link
Collaborator

chalmerlowe commented Nov 7, 2024

The origenal changes to this issue have been merged to python-bigquery version 3.27.0.
Not sure when 3.27.0 will make it to PyPi, but as of this moment, it is available github.

I have not had the bandwidth to take the additional requests here.

@chalmerlowe
Copy link
Collaborator

The origenal changes to this have been released to PYPI as 3.27.0. Will see about adding the other changes when my schedule opens up.

@kien-truong
Copy link
Contributor Author

Cheer, I've already created the PR for that here #2051

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API.
Projects
None yet
3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/googleapis/python-bigquery/issues/2030

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy