mashr
Version:
Simple data pipeline framework for GCP's BigQuery
224 lines (195 loc) • 12.3 kB
Markdown
# GCP Locations, Regions, and Zones Considerations
## GBQ: Google Big Query, Dataset Regions and Multi-Regions
* https://cloud.google.com/bigquery/docs/locations
* The default location is US.
- https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets
### Considerations when choosing Regions
* **Colocate your BigQuery dataset and your external data source.**
* **Colocate your Cloud Storage buckets for loading data.**
- If your BigQuery dataset is in a multi-regional location, the Cloud Storage
bucket containing the data you're loading must be in a regional or
multi-regional bucket in the same location.
- For example, if your BigQuery dataset is in the EU, the Cloud Storage
bucket must be in a regional or multi-regional bucket in the EU.
- If your dataset is in a regional location, your Cloud Storage bucket must
be a regional bucket in the same location.
- For example, if your dataset is in the Tokyo region, your Cloud Storage
bucket must be a regional bucket in Tokyo.
- Exception: If your dataset is in the US multi-regional location, you can
load data from a Cloud Storage bucket in any regional or multi-regional
location.
* **When you query data in an external data source such as Cloud Storage, the
data you're querying must be in the same [mulit-regional] location as your
BigQuery dataset.**
- For example, if your BigQuery dataset is in the EU multi-regional location,
the Cloud Storage bucket containing the data you're querying must be in a
multi-regional bucket in the EU. If your dataset is in the US
multi-regional location, your Cloud Storage bucket must be in a
multi-regional bucket in the US.
* **If your dataset is in a regional location, the Cloud Storage bucket
containing the data you're querying must be in a regional bucket in the same
location.**
- For example, if your dataset is in the Tokyo region, your Cloud Storage
bucket must be a regional bucket in Tokyo.
* If your external dataset is in Cloud Bigtable, your dataset must be in the US
or the EU multi-regional location. Your Cloud Bigtable data must be in one of
the supported Cloud Bigtable locations.
* Location considerations do not apply to Google Drive external data sources.
* Colocate your Cloud Storage buckets for exporting data.
- When you export data, the regional or multi-regional Cloud Storage bucket
must be in the same location as the BigQuery dataset. For example, if your
BigQuery dataset is in the EU multi-regional location, the Cloud Storage
bucket containing the data you're exporting must be in a regional or
multi-regional location in the EU.
* If your dataset is in a regional location, your Cloud Storage bucket must be
a regional bucket in the same location.
- For example, if your dataset is in the Tokyo region, your Cloud Storage
bucket must be a regional bucket in Tokyo. Exception: If your dataset is
in the US multi-regional location, you can export data into a Cloud Storage
bucket in any regional or multi-regional location. Develop a data
management plan.
* If you choose a regional storage resource such as a BigQuery dataset or a
Cloud Storage bucket, develop a plan for geographically managing your data.
* When loading data, querying data, or exporting data, BigQuery determines the
location to run the job based on the datasets referenced in the request. This
can effect the price.
- For example, if a query references a table in a dataset stored in the
asia-northeast1 region, the query job will run in that region. If a query
does not reference any tables or other resources contained within datasets,
and no destination table is provided, the query job will run in the
location of the project's flat-rate reservation. If the project does not
have a flat-rate reservation, the job runs in the US region. If more than
one flat-rate reservation is associated with the project, the location of
the reservation with the largest number of slots is where the job runs.
* You can specify the location to run a job explicitly
### Background
* You specify a location for storing your BigQuery data when you create a
dataset. After you create the dataset, the location cannot be changed.
* There are two types of locations:
- A regional location is a specific geographic place, such as Tokyo. For more
information, see Regional resources on the Geography and Regions page.
- A multi-regional location is a large geographic area, such as the United
States, that contains at least two geographic places. For more information,
see Multi-regional resources on the Geography and Regions page.
* For more information on Cloud Storage locations, see Bucket Locations in the
Cloud Storage documentation.
## GCS: Google Cloud Storage Regions
* Regions only, zones not available for GCS.
* https://cloud.google.com/storage/docs/locations
* The default bucket location is within the US. If you do not specify a
location constraint, then your bucket and data added to it are stored on
servers in the US.
### How Region effects other services trying to access this service?
* **Compute Engine VM notes**
- Storing data in the same region as your Compute Engine VM instances can
provide **better performance and lower network costs.** These advantages
apply to both regional and dual-regional locations.
- While you can't specify a Compute Engine zone as a bucket location, all
Compute Engine VM instances in zones within a certain regional location
have similar performance when accessing buckets in that regional location.
### Background
* A good location balances latency, availability, and bandwidth costs for data
consumers.
* Use a **regional** location to help optimize latency, availability, and
network bandwidth for data consumers grouped in the same region.
* Store frequently accessed data, such as data used for analytics, as **Regional
Storage.**
* Store data typically accessed less than once a month, such as archived data,
as **Nearline Storage.**
* Store data typically accessed less than once a year, such as backup and
disaster recovery data, as **Coldline Storage.**
* Use a **dual-regional** location when you want similar performance advantages
as regional locations but with added geo-redundancy.
* Use a **general multi-regional** location when you want to serve content to
data consumers that are outside of the Google network and distributed across
large geographic areas, or when you need your data to be geo-redundant.
* Store frequently accessed data as **Multi-Regional Storage.**
* Store data typically accessed less than once a month as **Nearline Storage.**
* Store data typically accessed less than once a year as **Coldline Storage.**
* If you're not sure which location type to use or have no scenario in mind,
use a regional location that is convenient or contains the majority of the
users of your data.
* The different location types:
- A regional location is a specific geographic place, such as London.
- A multi-regional location is a large geographic area, such as the United
States, that contains at least two geographic places.
- A dual-regional locationBeta is a special type of multi-regional location
that consists of two specific regional locations.
* Objects stored in a multi-regional location are geo-redundant.
* Some storage classes can only be used in a certain type of location.
* You must store Regional Storage object data in a regional location, such as
us-east1.
* You must store Multi-Regional Storage object data in a multi-regional
location (which includes dual-regional locations) such as eu.
* You can store Nearline Storage and Coldline Storage object data in any
location.
## GCE: Google Compute Engine, Instance Regions and Zones
* https://cloud.google.com/compute/docs/regions-zones/#choosing_a_region_and_zone
* When you create a new project, Compute Engine automatically selects a default
region and zone for the project, based on the location from where the project
was created. Compute Engine attempts to pick a region and a zone that are
close to where the project originated so that resources you create have
reduced latency to your customers or clients. You can override the default
zone and region for a project if you want to create resources in a different
region or zone instead.
- https://cloud.google.com/compute/docs/regions-zones/changing-default-zone-region
### Considerations when choosing Regions and Zones
* Handling failures:
- Distribute your resources across multiple zones and regions to tolerate
outages.
- if a zone becomes unavailable, you can transfer traffic to another zone in
the same region to keep your services running. Similarly, if a region
experiences any disturbances, you should have backup services running in a
different region.
* Decreased network latency:
- To decrease network latency, you might want to choose a region or zone that
is close to your point of service. For example, if you mostly have
customers on the East Coast of the US, then you might want to choose a
primary region and zone that is close to that area and a backup region and
zone that is also close by.
* Communication within and across regions will incur different costs.
- Generally, communication within regions will always be cheaper and faster
than communication across different regions.
* Design important systems with redundancy across multiple zones.
- At some point in time, it is possible that your instances might experience
an unexpected failure. To mitigate the effects of these possible events,
you should duplicate important systems in multiple zones and regions.
### Background
* Google designs zones to be independent from each other: a zone usually has
power, cooling, networking, and control planes that are isolated from other
zones, and most single failure events will affect only a single zone. Thus,
if a zone becomes unavailable, you can transfer traffic to another zone in
the same region to keep your services running. Similarly, if a region
experiences any disturbances, you should have backup services running in a
different region. For more information about distributing your resources and
designing a robust system, see Designing Robust Systems.
* For example, by hosting instances in zones europe-west1-b and europe-west1-c,
if europe-west1-b fails unexpectedly, your instances in zone europe-west1-c
will still be available. However, if you host all your instances in
europe-west1-b, you will not be able to access any instances if
europe-west1-b goes offline. You should also consider hosting your resources
across regions. For example, consider hosting backup instances in a zone in
europe-west3 in the unlikely scenario that the europe-west1 region
experiences a failure. For more tips on how to design systems for
availability, see Designing Robust Systems.
## GCF: Google Cloud Function Regions
* Regions only, zones not available for GCS.
* https://cloud.google.com/functions/docs/locations
### Considerations when choosing Regions
* Using services across multiple locations can affect your app's latency, as
well as pricing.
* Your primary considerations should be latency and availability.
* You can generally select the region closest to your Cloud Function's users.
* But you should also consider the location of the other GCP products and
services that your app uses.
* You can deploy functions to different regions within a project, but once the
region has been selected for a function it cannot be changed.
* Functions in a given region in a given project must have unique (case
insensitive) names, but functions across regions or across projects can share
the same name.
### Background
* Cloud Functions is regional, which means the infrastructure that runs your
Cloud Function is located in a specific region and is managed by Google to be
redundantly available across all the zones within that region.
## Global, Regional, and Zonal Resources
* https://cloud.google.com/compute/docs/regions-zones/global-regional-zonal-resources