hdx.data.dataset
module hdx.data.dataset
Dataset class containing all logic for creating, checking, and updating datasets and associated resources.
Classes
-
Dataset — Dataset class enabling operations on datasets and associated resources.
class NotRequestableError()
Bases : HDXError
class Dataset(initial_data: dict | None = None, configuration: Configuration | None = None)
Bases : HDXObject
Dataset class enabling operations on datasets and associated resources.
Parameters
-
initial_data : dict | None — Initial dataset metadata dictionary. Defaults to None.
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
Methods
-
actions — Dictionary of actions that can be performed on object
-
separate_resources — Move contents of resources key in internal dictionary into self.resources
-
unseparate_resources — Move self.resources into resources key in internal dictionary
-
get_dataset_dict — Move self.resources into resources key in internal dictionary
-
save_to_json — Save dataset to JSON. If follow_urls is True, resource urls that point to datasets are followed to retrieve final urls.
-
load_from_json — Load dataset from JSON
-
init_resources — Initialise self.resources list
-
add_update_resource — Add new or update existing resource in dataset with new metadata
-
add_update_resources — Add new to the dataset or update existing resources with new metadata
-
delete_resource — Delete a resource from the dataset and also from HDX by default
-
get_resources — Get dataset's resources
-
get_resource — Get one resource from dataset by index
-
number_of_resources — Get number of dataset's resources
-
move_resource — Move resource in dataset to be before the resource whose name starts with the value of insert_before.
-
update_from_yaml — Update dataset metadata with static metadata from YAML file
-
update_from_json — Update dataset metadata with static metadata from JSON file
-
read_from_hdx — Reads the dataset given by identifier from HDX and returns Dataset object
-
reorder_resources — Reorder resources in dataset according to provided list. Resources are updated in the dataset object to match new order. However, the dataset is not refreshed by rereading from HDX. If only some resource ids are supplied then these are assumed to be first and the other resources will stay in their original order.
-
check_resources_url_filetoupload — Check for error where both url or file to upload are provided for resources
-
check_resources_fields — Check that metadata for resources is complete. The parameter ignore_fields should be set if required to any fields that should be ignored for the particular operation.
-
check_required_fields — Check that metadata for dataset is complete. The parameter ignore_fields should be set if required to any fields that should be ignored for the particular operation.
-
revise — Revises an HDX dataset in HDX
-
update_in_hdx — Check if dataset exists in HDX and if so, update it. match_resources_by_metadata uses ids if they are available, otherwise names only if names are unique or format in addition if not.
-
create_in_hdx — Check if dataset exists in HDX and if so, update it, otherwise create it. match_resources_by_metadata uses ids if they are available, otherwise names only if names are unique or format in addition if not.
-
delete_from_hdx — Deletes a dataset from HDX.
-
search_in_hdx — Searches for datasets in HDX
-
get_all_dataset_names — Get all dataset names in HDX
-
get_all_datasets — Get all datasets from HDX (just calls search_in_hdx)
-
get_all_resources — Get all resources from a list of datasets (such as returned by search)
-
autocomplete — Autocomplete a dataset name and return matches
-
get_time_period — Get dataset date as datetimes and strings in specified format. If no format is supplied, the ISO 8601 format is used. Returns a dictionary containing keys startdate (start date as datetime), enddate (end date as datetime), startdate_str (start date as string), enddate_str (end date as string) and ongoing (whether the end date is a rolls forward every day).
-
set_time_period — Set time period from either datetime objects or strings. Any time and time zone information will be ignored by default (meaning that the time of the start date is set to 00:00:00, the time of any end date is set to 23:59:59 and the time zone is set to UTC). To have the time and time zone accounted for, set ignore_timeinfo to False. In this case, the time will be converted to UTC.
-
set_time_period_year_range — Set time period as a range from year or start and end year.
-
list_valid_update_frequencies — List of valid update frequency values
-
transform_update_frequency — Get numeric update frequency (as string since that is required field format) from textual representation or vice versa (eg. 'Every month' = '30', '30' or 30 = 'Every month')
-
get_expected_update_frequency — Get expected update frequency (in textual rather than numeric form)
-
set_expected_update_frequency — Set expected update frequency. You can pass frequencies like "Every week" or '7' or 7. Valid values for update frequency can be found from Dataset.list_valid_update_frequencies().
-
get_tags — Return the dataset's list of tags
-
add_tag — Add a tag
-
add_tags — Add a list of tags
-
clean_tags — Clean tags in an HDX object according to tags cleanup spreadsheet, deleting invalid tags that cannot be mapped
-
remove_tag — Remove a tag
-
is_subnational — Return if the dataset is subnational
-
set_subnational — Set if dataset is subnational or national
-
get_location_iso3s — Return the dataset's location
-
get_location_names — Return the dataset's location
-
add_country_location — Add a country. If an iso 3 code is not provided, value is parsed and if it is a valid country name, converted to an iso 3 code. If the country is already added, it is ignored.
-
add_country_locations — Add a list of countries. If iso 3 codes are not provided, values are parsed and where they are valid country names, converted to iso 3 codes. If any country is already added, it is ignored.
-
add_region_location — Add all countries in a region. If a 3 digit UNStats M49 region code is not provided, value is parsed as a region name. If any country is already added, it is ignored.
-
add_other_location — Add a location which is not a country or region. Value is parsed and compared to existing locations in HDX. If the location is already added, it is ignored.
-
remove_location — Remove a location. If the location is already added, it is ignored.
-
get_maintainer — Get the dataset's maintainer.
-
set_maintainer — Set the dataset's maintainer.
-
get_organization — Get the dataset's organization.
-
set_organization — Set the dataset's organization.
-
get_showcases — Get any showcases the dataset is in
-
add_showcase — Add dataset to showcase
-
add_showcases — Add dataset to multiple showcases
-
remove_showcase — Remove dataset from showcase
-
is_requestable — Return whether the dataset is requestable or not
-
set_requestable — Set the dataset to be of type requestable or not
-
get_fieldnames — Return list of fieldnames in your data. Only applicable to requestable datasets.
-
add_fieldname — Add a fieldname to list of fieldnames in your data. Only applicable to requestable datasets.
-
add_fieldnames — Add a list of fieldnames to list of fieldnames in your data. Only applicable to requestable datasets.
-
remove_fieldname — Remove a fieldname. Only applicable to requestable datasets.
-
get_filetypes — Return list of filetypes in your data
-
add_filetype — Add a filetype to list of filetypes in your data. Only applicable to requestable datasets.
-
add_filetypes — Add a list of filetypes to list of filetypes in your data. Only applicable to requestable datasets.
-
remove_filetype — Remove a filetype
-
set_custom_viz — Set custom visualization url for dataset
-
get_custom_viz — Get custom visualization url for dataset
-
preview_off — Set dataset preview off
-
preview_resource — Set dataset preview on for an unspecified resource
-
set_preview_resource — Set the resource that will be used for displaying previews in dataset preview
-
create_default_views — Create default resource views for all resources in dataset
-
get_name_or_id — Get dataset name or id eg. for use in urls. If prefer_name is True, name is preferred over id if available, otherwise id is preferred over name if available.
-
get_hdx_url — Get the url of the dataset on HDX or None if the dataset name and id fields are missing. If prefer_name is True, name is preferred over id if available, otherwise id is preferred over name if available.
-
get_api_url — Get the API url of the dataset on HDX
-
generate_resource — Write rows to file and create resource, adding it to the dataset. The headers argument is either a row number (rows start counting at 1), or the actual headers defined as a list of strings. If not set, all rows will be treated as containing values. Specific columns to include can be specified (ie. a subset of the headers).
-
download_generate_resource — Download url, write rows to csv and create resource, adding to it the dataset. The returned dictionary will contain the resource in the key resource, headers in the key headers and list of rows in the key rows.
-
add_hapi_error — Writes error messages that were uncovered while processing data for the HAPI database to a resource's metadata on HDX. If the resource already has an error message, it is only overwritten if the two messages are different.
staticmethod Dataset.actions() → dict[str, str]
Dictionary of actions that can be performed on object
Returns
-
dict[str, str] — Dictionary of actions that can be performed on object
method Dataset.separate_resources() → None
Move contents of resources key in internal dictionary into self.resources
Returns
-
None — None
method Dataset.unseparate_resources() → None
Move self.resources into resources key in internal dictionary
Returns
-
None — None
method Dataset.get_dataset_dict() → dict
Move self.resources into resources key in internal dictionary
Returns
-
dict — Dataset dictionary
method Dataset.save_to_json(path: Path | str, follow_urls: bool = False, session: Session | None = None) → None
Save dataset to JSON. If follow_urls is True, resource urls that point to datasets are followed to retrieve final urls.
Parameters
-
path : Path | str — Path to save dataset
-
follow_urls : bool — Whether to follow urls. Defaults to False.
-
session : Session | None
Returns
-
None — None
staticmethod Dataset.load_from_json(path: Path | str) → Optional['Dataset']
Load dataset from JSON
Parameters
-
path : Path | str — Path to load dataset
Returns
-
Optional['Dataset'] — Dataset created from JSON or None
method Dataset.init_resources() → None
Initialise self.resources list
Returns
-
None — None
method Dataset.add_update_resource(resource: Union['Resource', dict, str], ignore_datasetid: bool = False) → Resource
Add new or update existing resource in dataset with new metadata
Parameters
-
resource : Union['Resource', dict, str] — Either resource id or resource metadata from a Resource object or a dictionary
-
ignore_datasetid : bool — Whether to ignore dataset id in the resource
Returns
-
Resource — The resource that was added after matching with any existing resource
Raises
-
HDXError
method Dataset.add_update_resources(resources: Sequence[Union['Resource', dict, str]], ignore_datasetid: bool = False) → None
Add new to the dataset or update existing resources with new metadata
Parameters
-
resources : Sequence[Union['Resource', dict, str]] — A list of either resource ids or resources metadata from either Resource objects or dictionaries
-
ignore_datasetid : bool — Whether to ignore dataset id in the resource. Defaults to False.
Returns
-
None — None
Raises
-
HDXError
method Dataset.delete_resource(resource: Union['Resource', dict, str], delete: bool = True) → bool
Delete a resource from the dataset and also from HDX by default
Parameters
-
resource : Union['Resource', dict, str] — Either resource id or resource metadata from a Resource object or a dictionary
-
delete : bool — Whetehr to delete the resource from HDX (not just the dataset). Defaults to True.
Returns
-
bool — True if resource removed or False if not
Raises
-
HDXError
method Dataset.get_resources() → list['Resource']
Get dataset's resources
Returns
-
list['Resource'] — List of Resource objects
method Dataset.get_resource(index: int = 0) → Resource
Get one resource from dataset by index
Parameters
-
index : int — Index of resource in dataset. Defaults to 0.
Returns
-
Resource — Resource object
method Dataset.number_of_resources() → int
Get number of dataset's resources
Returns
-
int — Number of Resource objects
method Dataset.move_resource(resource_name: str, insert_before: str) → Resource
Move resource in dataset to be before the resource whose name starts with the value of insert_before.
Parameters
-
resource_name : str — Name of resource to move
-
insert_before : str — Resource to insert before
Returns
-
Resource — The resource that was moved
method Dataset.update_from_yaml(path: Path | str = Path('config', 'hdx_dataset_static.yaml')) → None
Update dataset metadata with static metadata from YAML file
Parameters
-
path : Path | str — Path to YAML dataset metadata. Defaults to config/hdx_dataset_static.yaml.
Returns
-
None — None
method Dataset.update_from_json(path: Path | str = Path('config', 'hdx_dataset_static.json')) → None
Update dataset metadata with static metadata from JSON file
Parameters
-
path : Path | str — Path to JSON dataset metadata. Defaults to config/hdx_dataset_static.json.
Returns
-
None — None
staticmethod Dataset.read_from_hdx(identifier: str, configuration: Configuration | None = None) → Optional['Dataset']
Reads the dataset given by identifier from HDX and returns Dataset object
Parameters
-
identifier : str — Identifier of dataset
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
Returns
-
Optional['Dataset'] — Dataset object if successful read, None if not
method Dataset.reorder_resources(resource_ids: Sequence[str]) → None
Reorder resources in dataset according to provided list. Resources are updated in the dataset object to match new order. However, the dataset is not refreshed by rereading from HDX. If only some resource ids are supplied then these are assumed to be first and the other resources will stay in their original order.
Parameters
-
resource_ids : Sequence[str] — List of resource ids
Returns
-
None — None
Raises
-
HDXError
method Dataset.check_resources_url_filetoupload() → None
Check for error where both url or file to upload are provided for resources
Returns
-
None — None
method Dataset.check_resources_fields(ignore_fields: Sequence[str] = ()) → None
Check that metadata for resources is complete. The parameter ignore_fields should be set if required to any fields that should be ignored for the particular operation.
Parameters
-
ignore_fields : Sequence[str] — Fields to ignore. Default is ().
Returns
-
None — None
method Dataset.check_required_fields(ignore_fields: Sequence[str] = (), allow_no_resources: bool = False, **kwargs: Any) → None
Check that metadata for dataset is complete. The parameter ignore_fields should be set if required to any fields that should be ignored for the particular operation.
Parameters
-
ignore_fields : Sequence[str] — Fields to ignore. Default is ().
-
allow_no_resources : bool — Whether to allow no resources. Defaults to False.
Returns
-
None — None
Raises
-
HDXError
staticmethod Dataset.revise(match: dict[str, Any], filter: Sequence[str] = (), update: dict[str, Any] = {}, files_to_upload: dict[str, str] = {}, configuration: Configuration | None = None, **kwargs: Any) → Dataset
Revises an HDX dataset in HDX
Parameters
-
match : Dict[str,Any] — Metadata on which to match dataset
-
filter : Sequence[str] — Filters to apply. Defaults to tuple().
-
update : dict[str, Any] — Metadata updates to apply. Defaults to {}.
-
files_to_upload : dict[str, str] — Files to upload to HDX. Defaults to {}.
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
-
**kwargs : Any — Additional arguments to pass to package_revise
Returns
-
Dataset — Dataset object
method Dataset.update_in_hdx(allow_no_resources: bool = False, update_resources: bool = True, match_resources_by_metadata: bool = True, keys_to_delete: Sequence[str] = (), remove_additional_resources: bool = False, match_resource_order: bool = False, create_default_views: bool = True, **kwargs: Any) → dict
Check if dataset exists in HDX and if so, update it. match_resources_by_metadata uses ids if they are available, otherwise names only if names are unique or format in addition if not.
Returns a dictionary with key resource name and value status code
0 = no file to upload and last_modified set to now (resource creation or data_updated flag is True), 1 = no file to upload and data_updated flag is False, 2 = file uploaded to filestore (resource creation or either hash or size of file has changed), 3 = file not uploaded to filestore (hash and size of file are the same), 4 = file not uploaded (hash, size unchanged), given last_modified ignored
Parameters
-
allow_no_resources : bool — Whether to allow no resources. Defaults to False.
-
update_resources : bool — Whether to update resources. Defaults to True.
-
match_resources_by_metadata : bool — Compare resource metadata rather than position in list. Defaults to True.
-
keys_to_delete : Sequence[str] — List of top level metadata keys to delete. Defaults to tuple().
-
remove_additional_resources : bool — Remove additional resources found in dataset. Defaults to False.
-
match_resource_order : bool — Match order of given resources by name. Defaults to False.
-
create_default_views : bool — Whether to call package_create_default_resource_views. Defaults to True.
-
**kwargs : Any — See below
-
keep_crisis_tags : bool — Whether to keep existing crisis tags. Defaults to True.
-
updated_by_script : str — String to identify your script. Defaults to your user agent.
-
batch : str — A string you can specify to show which datasets are part of a single batch update
-
force_update : bool — Forces files to be updated even if they haven't changed
Returns
-
dict — Status codes of resources
Raises
-
HDXError
method Dataset.create_in_hdx(allow_no_resources: bool = False, update_resources: bool = True, match_resources_by_metadata: bool = True, keys_to_delete: Sequence[str] = (), remove_additional_resources: bool = False, match_resource_order: bool = False, create_default_views: bool = True, **kwargs: Any) → dict
Check if dataset exists in HDX and if so, update it, otherwise create it. match_resources_by_metadata uses ids if they are available, otherwise names only if names are unique or format in addition if not.
Returns a dictionary with key resource name and value status code
0 = no file to upload and last_modified set to now (resource creation or data_updated flag is True), 1 = no file to upload and data_updated flag is False, 2 = file uploaded to filestore (resource creation or either hash or size of file has changed), 3 = file not uploaded to filestore (hash and size of file are the same), 4 = file not uploaded (hash, size unchanged), given last_modified ignored
Parameters
-
allow_no_resources : bool — Whether to allow no resources. Defaults to False.
-
update_resources : bool — Whether to update resources (if updating). Defaults to True.
-
match_resources_by_metadata : bool — Compare resource metadata rather than position in list. Defaults to True.
-
keys_to_delete : Sequence[str] — List of top level metadata keys to delete. Defaults to tuple().
-
remove_additional_resources : bool — Remove additional resources found in dataset (if updating). Defaults to False.
-
match_resource_order : bool — Match order of given resources by name. Defaults to False.
-
create_default_views : bool — Whether to call package_create_default_resource_views (if updating). Defaults to True.
-
**kwargs : Any — See below
-
keep_crisis_tags : bool — Whether to keep existing crisis tags. Defaults to True.
-
updated_by_script : str — String to identify your script. Defaults to your user agent.
-
batch : str — A string you can specify to show which datasets are part of a single batch update
-
force_update : bool — Forces files to be updated even if they haven't changed
Returns
-
dict — Status codes of resources
method Dataset.delete_from_hdx() → None
Deletes a dataset from HDX.
Returns
-
None — None
classmethod Dataset.search_in_hdx(query: str | None = ':', configuration: Configuration | None = None, page_size: int = 1000, **kwargs: Any) → list['Dataset']
Searches for datasets in HDX
Parameters
-
query : str | None — Query (in Solr format). Defaults to ':'.
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
-
page_size : int — Size of page to use internally to query HDX. Defaults to 1000.
-
**kwargs : Any — See below
-
fq : string — Any filter queries to apply
-
rows : int — Number of matching rows to return. Defaults to all datasets (sys.maxsize).
-
start : int — Offset in the complete result for where the set of returned datasets should begin
-
sort : string — Sorting of results. Defaults to 'relevance asc, metadata_modified desc' if rows<=page_size or 'metadata_modified asc' if rows>page_size.
-
facet : string — Whether to enable faceted results. Default to True.
-
facet.mincount : int — Minimum counts for facet fields should be included in the results
-
facet.limit : int — Maximum number of values the facet fields return (- = unlimited). Defaults to 50.
-
facet.field : list[str] — Fields to facet upon. Default is empty.
-
use_default_schema : bool — Use default package schema instead of custom schema. Defaults to False.
Returns
-
list['Dataset'] — list of datasets resulting from query
Raises
-
HDXError
staticmethod Dataset.get_all_dataset_names(configuration: Configuration | None = None, **kwargs: Any) → list[str]
Get all dataset names in HDX
Parameters
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
-
**kwargs : Any — See below
-
rows : int — Number of rows to return. Defaults to all datasets (sys.maxsize)
-
start : int — Offset in the complete result for where the set of returned dataset names should begin
Returns
-
list[str] — list of all dataset names in HDX
classmethod Dataset.get_all_datasets(configuration: Configuration | None = None, page_size: int = 1000, **kwargs: Any) → list['Dataset']
Get all datasets from HDX (just calls search_in_hdx)
Parameters
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
-
page_size : int — Size of page to use internally to query HDX. Defaults to 1000.
-
**kwargs : Any — See below
-
fq : string — Any filter queries to apply
-
rows : int — Number of matching rows to return. Defaults to all datasets (sys.maxsize).
-
start : int — Offset in the complete result for where the set of returned datasets should begin
-
sort : string — Sorting of results. Defaults to 'metadata_modified asc'.
-
facet : string — Whether to enable faceted results. Default to True.
-
facet.mincount : int — Minimum counts for facet fields should be included in the results
-
facet.limit : int — Maximum number of values the facet fields return (- = unlimited). Defaults to 50.
-
facet.field : list[str] — Fields to facet upon. Default is empty.
-
use_default_schema : bool — Use default package schema instead of custom schema. Defaults to False.
Returns
-
list['Dataset'] — list of datasets resulting from query
staticmethod Dataset.get_all_resources(datasets: Sequence['Dataset']) → list['Resource']
Get all resources from a list of datasets (such as returned by search)
Parameters
-
datasets : Sequence['Dataset'] — list of datasets
Returns
-
list['Resource'] — list of resources within those datasets
classmethod Dataset.autocomplete(name: str, limit: int = 20, configuration: Configuration | None = None) → list
Autocomplete a dataset name and return matches
Parameters
-
name : str — Name to autocomplete
-
limit : int — Maximum number of matches to return
-
configuration : Configuration | None — HDX configuration. Defaults to global configuration.
Returns
-
list — Autocomplete matches
method Dataset.get_time_period(date_format: str | None = None, today: datetime = now_utc()) → dict
Get dataset date as datetimes and strings in specified format. If no format is supplied, the ISO 8601 format is used. Returns a dictionary containing keys startdate (start date as datetime), enddate (end date as datetime), startdate_str (start date as string), enddate_str (end date as string) and ongoing (whether the end date is a rolls forward every day).
Parameters
-
date_format : str | None — Date format. None is taken to be ISO 8601. Defaults to None.
-
today : datetime — Date to use for today. Defaults to now_utc().
Returns
-
dict — Dictionary of date information
method Dataset.set_time_period(startdate: datetime | str, enddate: datetime | str | None = None, ongoing: bool = False, ignore_timeinfo: bool = True) → None
Set time period from either datetime objects or strings. Any time and time zone information will be ignored by default (meaning that the time of the start date is set to 00:00:00, the time of any end date is set to 23:59:59 and the time zone is set to UTC). To have the time and time zone accounted for, set ignore_timeinfo to False. In this case, the time will be converted to UTC.
Parameters
-
startdate : datetime | str — Dataset start date
-
enddate : datetime | str | None — Dataset end date. Defaults to None.
-
ongoing : bool — True if ongoing, False if not. Defaults to False.
-
ignore_timeinfo : bool — Ignore time and time zone of date. Defaults to True.
Returns
-
None — None
method Dataset.set_time_period_year_range(dataset_year: str | int | Iterable, dataset_end_year: str | int | None = None) → list[int]
Set time period as a range from year or start and end year.
Parameters
-
dataset_year : str | int | Iterable — Dataset year given as string or int or range in an iterable
-
dataset_end_year : str | int | None — Dataset end year given as string or int
Returns
-
list[int] — The start and end year if supplied or sorted list of years
classmethod Dataset.list_valid_update_frequencies() → list[str]
List of valid update frequency values
Returns
-
list[str] — Allowed update frequencies
classmethod Dataset.transform_update_frequency(frequency: str | int) → str | None
Get numeric update frequency (as string since that is required field format) from textual representation or vice versa (eg. 'Every month' = '30', '30' or 30 = 'Every month')
Parameters
-
frequency : str | int — Update frequency in one format
Returns
-
str | None — Update frequency in alternative format or None if not valid
method Dataset.get_expected_update_frequency() → str | None
Get expected update frequency (in textual rather than numeric form)
Returns
-
str | None — Update frequency in textual form or None if the update frequency doesn't exist or is blank.
method Dataset.set_expected_update_frequency(update_frequency: str | int) → None
Set expected update frequency. You can pass frequencies like "Every week" or '7' or 7. Valid values for update frequency can be found from Dataset.list_valid_update_frequencies().
Parameters
-
update_frequency : str | int — Update frequency
Returns
-
None — None
Raises
-
HDXError
method Dataset.get_tags() → list[str]
Return the dataset's list of tags
Returns
-
list[str] — list of tags or [] if there are none
method Dataset.add_tag(tag: str, log_deleted: bool = True) → tuple[list[str], list[str]]
Add a tag
Parameters
-
tag : str — Tag to add
-
log_deleted : bool — Whether to log informational messages about deleted tags. Defaults to True.
Returns
-
tuple[list[str], list[str]] — Tuple containing list of added tags and list of deleted tags and tags not added
method Dataset.add_tags(tags: Sequence[str], log_deleted: bool = True) → tuple[list[str], list[str]]
Add a list of tags
Parameters
-
tags : Sequence[str] — List of tags to add
-
log_deleted : bool — Whether to log informational messages about deleted tags. Defaults to True.
Returns
-
tuple[list[str], list[str]] — Tuple containing list of added tags and list of deleted tags and tags not added
method Dataset.clean_tags(log_deleted: bool = True) → tuple[list[str], list[str]]
Clean tags in an HDX object according to tags cleanup spreadsheet, deleting invalid tags that cannot be mapped
Parameters
-
log_deleted : bool — Whether to log informational messages about deleted tags. Defaults to True.
Returns
-
tuple[list[str], list[str]] — Tuple containing list of mapped tags and list of deleted tags and tags not added
method Dataset.remove_tag(tag: str) → bool
Remove a tag
Parameters
-
tag : str — Tag to remove
Returns
-
bool — True if tag removed or False if not
method Dataset.is_subnational() → bool
Return if the dataset is subnational
Returns
-
bool — True if the dataset is subnational, False if not
method Dataset.set_subnational(subnational: bool) → None
Set if dataset is subnational or national
Parameters
-
subnational : bool — True for subnational, False for national
Returns
-
None — None
method Dataset.get_location_iso3s(locations: Sequence[str] | None = None) → list[str]
Return the dataset's location
Parameters
-
locations : Sequence[str] | None — Valid locations list. Defaults to list downloaded from HDX.
Returns
-
list[str] — list of location iso3s
method Dataset.get_location_names(locations: Sequence[str] | None = None) → list[str]
Return the dataset's location
Parameters
-
locations : Sequence[str] | None — Valid locations list. Defaults to list downloaded from HDX.
Returns
-
list[str] — list of location names
method Dataset.add_country_location(country: str, exact: bool = True, locations: Sequence[str] | None = None, use_live: bool = True) → bool
Add a country. If an iso 3 code is not provided, value is parsed and if it is a valid country name, converted to an iso 3 code. If the country is already added, it is ignored.
Parameters
-
country : str — Country to add
-
exact : bool — True for exact matching or False to allow fuzzy matching. Defaults to True.
-
locations : Sequence[str] | None — Valid locations list. Defaults to list downloaded from HDX.
-
use_live : bool — Try to get use latest country data from web rather than file in package. Defaults to True.
Returns
-
bool — True if country added or False if country already present
Raises
-
HDXError
method Dataset.add_country_locations(countries: Sequence[str], locations: Sequence[str] | None = None, use_live: bool = True) → bool
Add a list of countries. If iso 3 codes are not provided, values are parsed and where they are valid country names, converted to iso 3 codes. If any country is already added, it is ignored.
Parameters
-
countries : Sequence[str] — List of countries to add
-
locations : Sequence[str] | None — Valid locations list. Defaults to list downloaded from HDX.
-
use_live : bool — Try to get use latest country data from web rather than file in package. Defaults to True.
Returns
-
bool — True if all countries added or False if any already present.
method Dataset.add_region_location(region: str, locations: Sequence[str] | None = None, use_live: bool = True) → bool
Add all countries in a region. If a 3 digit UNStats M49 region code is not provided, value is parsed as a region name. If any country is already added, it is ignored.
Parameters
-
region : str — M49 region, intermediate region or subregion to add
-
locations : Sequence[str] | None — Valid locations list. Defaults to list downloaded from HDX.
-
use_live : bool — Try to get use latest country data from web rather than file in package. Defaults to True.
Returns
-
bool — True if all countries in region added or False if any already present.
method Dataset.add_other_location(location: str, exact: bool = True, alterror: str | None = None, locations: Sequence[str] | None = None) → bool
Add a location which is not a country or region. Value is parsed and compared to existing locations in HDX. If the location is already added, it is ignored.
Parameters
-
location : str — Location to add
-
exact : bool — True for exact matching or False to allow fuzzy matching. Defaults to True.
-
alterror : str | None — Alternative error message to builtin if location not found. Defaults to None.
-
locations : Sequence[str] | None — Valid locations list. Defaults to list downloaded from HDX.
Returns
-
bool — True if location added or False if location already present
Raises
-
HDXError
method Dataset.remove_location(location: str) → bool
Remove a location. If the location is already added, it is ignored.
Parameters
-
location : str — Location to remove
Returns
-
bool — True if location removed or False if not
method Dataset.get_maintainer() → User
method Dataset.set_maintainer(maintainer: Union['User', dict, str]) → None
Set the dataset's maintainer.
Parameters
-
maintainer : Union['User', dict, str] — Either a user id or User metadata from a User object or dictionary.
-
Returns — None
Raises
-
HDXError
method Dataset.get_organization() → Organization
method Dataset.set_organization(organization: Union['Organization', dict, str]) → None
Set the dataset's organization.
Parameters
-
organization : Union['Organization', dict, str] — Either an Organization id or Organization metadata from an Organization object or dictionary.
-
Returns — None
Raises
-
HDXError
method Dataset.get_showcases() → list['Showcase']
Get any showcases the dataset is in
Returns
-
list['Showcase'] — List of showcases
method Dataset.add_showcase(showcase: Union['Showcase', dict, str], showcases_to_check: Sequence['Showcase'] = None) → bool
Add dataset to showcase
Parameters
-
showcase : Union['Showcase', dict, str] — Either a showcase id or showcase metadata from a Showcase object or dictionary
-
showcases_to_check : Sequence['Showcase'] — List of showcases against which to check existence of showcase. Defaults to showcases containing dataset.
Returns
-
bool — True if the showcase was added, False if already present
method Dataset.add_showcases(showcases: Sequence[Union['Showcase', dict, str]], showcases_to_check: Sequence['Showcase'] = None) → bool
Add dataset to multiple showcases
Parameters
-
showcases : Sequence[Union['Showcase', dict, str]] — A list of either showcase ids or showcase metadata from Showcase objects or dictionaries
-
showcases_to_check : Sequence['Showcase'] — list of showcases against which to check existence of showcase. Defaults to showcases containing dataset.
Returns
-
bool — True if all showcases added or False if any already present
method Dataset.remove_showcase(showcase: Union['Showcase', dict, str]) → None
Remove dataset from showcase
Parameters
-
showcase : Union['Showcase', dict, str] — Either a showcase id string or showcase metadata from a Showcase object or dictionary
Returns
-
None — None
method Dataset.is_requestable() → bool
Return whether the dataset is requestable or not
Returns
-
bool — Whether the dataset is requestable or not
method Dataset.set_requestable(requestable: bool = True) → None
Set the dataset to be of type requestable or not
Parameters
-
requestable : bool — Set whether dataset is requestable. Defaults to True.
Returns
-
None — None
method Dataset.get_fieldnames() → list[str]
Return list of fieldnames in your data. Only applicable to requestable datasets.
Returns
-
list[str] — List of field names
Raises
method Dataset.add_fieldname(fieldname: str) → bool
Add a fieldname to list of fieldnames in your data. Only applicable to requestable datasets.
Parameters
-
fieldname : str — Fieldname to add
Returns
-
bool — True if fieldname added or False if tag already present
Raises
method Dataset.add_fieldnames(fieldnames: Sequence[str]) → bool
Add a list of fieldnames to list of fieldnames in your data. Only applicable to requestable datasets.
Parameters
-
fieldnames : Sequence[str] — List of fieldnames to add
Returns
-
bool — True if all fieldnames added or False if any already present
Raises
method Dataset.remove_fieldname(fieldname: str) → bool
Remove a fieldname. Only applicable to requestable datasets.
Parameters
-
fieldname : str — Fieldname to remove
Returns
-
bool — True if fieldname removed or False if not
Raises
method Dataset.get_filetypes() → list[str]
Return list of filetypes in your data
Returns
-
list[str] — List of filetypes
method Dataset.add_filetype(filetype: str) → bool
Add a filetype to list of filetypes in your data. Only applicable to requestable datasets.
Parameters
-
filetype : str — filetype to add
Returns
-
bool — True if filetype added or False if tag already present
Raises
method Dataset.add_filetypes(filetypes: Sequence[str]) → bool
Add a list of filetypes to list of filetypes in your data. Only applicable to requestable datasets.
Parameters
-
filetypes : Sequence[str] — list of filetypes to add
Returns
-
bool — True if all filetypes added or False if any already present
Raises
method Dataset.remove_filetype(filetype: str) → bool
Remove a filetype
Parameters
-
filetype : str — Filetype to remove
Returns
-
bool — True if filetype removed or False if not
Raises
method Dataset.set_custom_viz(url: str) → None
Set custom visualization url for dataset
Parameters
-
url : str — Custom visualization url
Returns
-
None — None
method Dataset.get_custom_viz() → str | None
Get custom visualization url for dataset
Returns
-
Custom visualization url or None
method Dataset.preview_off() → None
Set dataset preview off
Returns
-
None — None
method Dataset.preview_resource() → None
Set dataset preview on for an unspecified resource
Returns
-
None — None
method Dataset.set_preview_resource(resource: Union['Resource', dict, str, int]) → Resource
Set the resource that will be used for displaying previews in dataset preview
Parameters
-
resource : Union['Resource', dict, str, int] — Either resource id or name, resource metadata from a Resource object or a dictionary or position
Returns
-
Resource — Resource that is used for preview or None if no preview set
Raises
-
HDXError
method Dataset.create_default_views(create_datastore_views: bool = False) → None
Create default resource views for all resources in dataset
Parameters
-
create_datastore_views : bool — Whether to try to create resource views that point to the datastore
Returns
-
None — None
method Dataset.get_name_or_id(prefer_name: bool = True) → str | None
Get dataset name or id eg. for use in urls. If prefer_name is True, name is preferred over id if available, otherwise id is preferred over name if available.
Parameters
-
prefer_name : bool — Whether name is preferred over id. Default to True.
Returns
-
str | None — HDX dataset id or name or None if not available
method Dataset.get_hdx_url(prefer_name: bool = True) → str | None
Get the url of the dataset on HDX or None if the dataset name and id fields are missing. If prefer_name is True, name is preferred over id if available, otherwise id is preferred over name if available.
Parameters
-
prefer_name : bool — Whether name is preferred over id in url. Default to True.
Returns
-
str | None — Url of the dataset on HDX or None if the dataset is missing fields
method Dataset.get_api_url(prefer_name: bool = True) → str | None
Get the API url of the dataset on HDX
Parameters
-
prefer_name : bool — Whether name is preferred over id in url. Default to True.
Returns
-
str | None — API url of the dataset on HDX or None if the dataset is missing fields
method Dataset.generate_resource(folder: Path | str, filename: str, rows: Iterable[Sequence | Mapping], resourcedata: dict, headers: int | Sequence[str] | None = None, columns: Sequence[int] | Sequence[str] | None = None, format: str = 'csv', encoding: str | None = None, datecol: int | str | None = None, yearcol: int | str | None = None, date_function: Callable[[dict], dict | None] | None = None, no_empty: bool = True) → tuple[bool, dict]
Write rows to file and create resource, adding it to the dataset. The headers argument is either a row number (rows start counting at 1), or the actual headers defined as a list of strings. If not set, all rows will be treated as containing values. Specific columns to include can be specified (ie. a subset of the headers).
The returned dictionary will contain the resource in the key resource, headers in the key headers and list of rows in the key rows.
The time period can optionally be set by supplying a column in which the date or year is to be looked up. Note that any timezone information is ignored and UTC assumed. Alternatively, a function can be supplied to handle any dates in a row. It should accept a row and should return None to ignore the row or a dictionary which can either be empty if there are no dates in the row or can be populated with keys startdate and/or enddate which are of type timezone-aware datetime. The lowest start date and highest end date are used to set the time period and are returned in the results dictionary in keys startdate and enddate.
Parameters
-
folder : Path | str — Folder to which to write file containing rows
-
filename : str — Filename of file to write rows
-
rows : Iterable[Sequence | Mapping] — List of rows in dict or list form
-
resourcedata : dict — Resource data
-
headers : int | Sequence[str] | None — All headers. Defaults to None.
-
columns : Sequence[int] | Sequence[str] | None — Columns to write. Defaults to all.
-
format : str — Format to write. Defaults to csv.
-
encoding : str | None — Encoding to use. Defaults to None (infer encoding).
-
datecol : int | str | None — Date column for setting time period. Defaults to None (don't set).
-
yearcol : int | str | None — Year column for setting dataset year range. Defaults to None (don't set).
-
date_function : Callable[[dict], dict | None] | None — Date function to call for each row. Defaults to None.
-
no_empty : bool — Don't generate resource if there are no data rows. Defaults to True.
Returns
-
tuple[bool, dict] — (True if resource added, dictionary of results)
Raises
-
HDXError
method Dataset.download_generate_resource(downloader: BaseDownload, url: str, folder: Path | str, filename: str, resourcedata: dict, header_insertions: Sequence[tuple[int, str]] | None = None, row_function: Callable[[list[str], dict], dict] | None = None, columns: Sequence[int] | Sequence[str] | None = None, format: str = 'csv', encoding: str | None = None, datecol: int | str | None = None, yearcol: int | str | None = None, date_function: Callable[[dict], dict | None] | None = None, no_empty: bool = True, **kwargs: Any) → tuple[bool, dict]
Download url, write rows to csv and create resource, adding to it the dataset. The returned dictionary will contain the resource in the key resource, headers in the key headers and list of rows in the key rows.
Optionally, headers can be inserted at specific positions. This is achieved using the header_insertions argument. If supplied, it is a list of tuples of the form (position, header) to be inserted. A function is called for each row. If supplied, it takes as arguments: headers (prior to any insertions) and row (which will be in dict or list form depending upon the dict_rows argument) and outputs a modified row.
The time period can optionally be set by supplying a column in which the date or year is to be looked up. Note that any timezone information is ignored and UTC assumed. Alternatively, a function can be supplied to handle any dates in a row. It should accept a row and should return None to ignore the row or a dictionary which can either be empty if there are no dates in the row or can be populated with keys startdate and/or enddate which are of type timezone-aware datetime. The lowest start date and highest end date are used to set the time period and are returned in the results dictionary in keys startdate and enddate.
Parameters
-
downloader : BaseDownload — A Download or Retrieve object
-
url : str — URL to download
-
folder : Path | str — Folder to which to write file containing rows
-
filename : str — Filename of file to write rows
-
resourcedata : dict — Resource data
-
header_insertions : Sequence[tuple[int, str]] | None — List of (position, header) to insert. Defaults to None.
-
row_function : Callable[[list[str], dict], dict] | None — Function to call for each row. Defaults to None.
-
columns : Sequence[int] | Sequence[str] | None — Columns to write. Defaults to all.
-
format : str — Format to write. Defaults to csv.
-
encoding : str | None — Encoding to use. Defaults to None (infer encoding).
-
datecol : int | str | None — Date column for setting time period. Defaults to None (don't set).
-
yearcol : int | str | None — Year column for setting dataset year range. Defaults to None (don't set).
-
date_function : Callable[[dict], dict | None] | None — Date function to call for each row. Defaults to None.
-
no_empty : bool — Don't generate resource if there are no data rows. Defaults to True.
-
**kwargs : Any — Any additional args to pass to downloader.get_tabular_rows
Returns
-
tuple[bool, dict] — (True if resource added, dictionary of results)
method Dataset.add_hapi_error(error_message: str, resource_name: str | None = None, resource_id: str | None = None) → bool
Writes error messages that were uncovered while processing data for the HAPI database to a resource's metadata on HDX. If the resource already has an error message, it is only overwritten if the two messages are different.
Parameters
-
error_message : str — Error(s) uncovered
-
resource_name : str | None — Resource name. Defaults to None
-
resource_id : str | None — Resource id. Defaults to None
Returns
-
bool — True if a message was added, False if not