BirdData is a python wrapper for Xeno-canto API 2.0. Enables user to download bird data with one command line. BirdData supports multithreading download.
Download repo to local:
git clone git@github.com:realzza/birdData.git
Set up environment:
pip install -r requirement.txt
Metadata is a simple configuration for each recording. Typically, metadata files contain information like recordist, recoding time, country, location, latitude, longitude, altitude, and recording length. Below is an example of a metadata file.
{
"id": "426350",
"gen": "Abroscopus",
"sp": "superciliaris",
"ssp": "",
"en": "Yellow-bellied Warbler",
"rec": "Peter Boesman",
"cnt": "India",
"loc": "Eagle Nest, Sessni area and lower, Arrunachal Pradesh",
"lat": "27.0223",
"lng": "92.4139",
"alt": "",
"type": "song",
"url": "//xeno-canto.org/426350",
"file": "https://xeno-canto.org/426350/download"
}
Use download_meta.py
to download metadata files. Customize your own query by defining multiple parameters before you request metadata from xeno-canto api.
optional arguments:
-h, --help show this help message and exit
--gen GEN genus
--ssp SSP subspecies
--cnt CNT country
--type TYPE type
--rmk RMK remark
--lat LAT latitude
--lon LON longtitude
--loc LOC location
--box BOX box:LAT_MIN,LON_MIN,LAT_MAX,LON_MAX
--area AREA Continent
--since SINCE e.g. since:2012-11-09
--year YEAR year
--month MONTH month
--output OUTPUT directory to output directory. default: `dataset/metadata/`
--attempts ATTEMPTS
A sample metadata downloading activity
python download-meta.py --cnt China --loc Shanghai --since 2022-01-01 --output test/
Please refer to the Search Tips for definitions about above parameters.
Download audio data for one bird species. Use scientific name starting with lowercase. e.g, cettia cetti.
python download.py --name "cettia cetti"
Download audio data for a file of species names. Format requirement: names divided by "\n"
python download.py --name name_file
General Usage:
usage: download.py [-h] --name NAME
download bird audios
optional arguments:
-h, --help show this help message and exit
--name NAME [1] name of one bird species; [2] file of bird species spaced
by '\n'
Speed up downloading using multiple threads.
python download-mult.py --name "cettia cetti" --process-ratio 0.6
Download multiple birds in a file, format requirement: names divided by "\n"
python download-mult.py --name name_file --process-ratio 0.6
General Usage:
usage: download-mult.py [-h] --name NAME [--process-ratio PROCESS_RATIO]
download bird audios
optional arguments:
-h, --help show this help message and exit
--name NAME [1] name of one bird species; [2] file of bird species
spaced by '\n'
--process-ratio PROCESS_RATIO
float[0~1], define cpu utilities in downloading audios
[default: 0.8]
It would be hard to kill multiprocess programs manually. download-mult.py
has a backdoor for this concern: it will automatically generate a kill.sh
after downloading started. Kill program by
bash kill.sh
Find download failure record at bad_urls.txt
so that you can redownload afterwards if necessary.
The bird data you download is in .mp3
format, unsupported by lightweight feature-extracting libraries such as soundfile
and audiofile
(librosa
is terribly slow compared to these two). Transform unextractable .mp3
into extractable .wav
by alignDataset-mult.py
script.
python alignDataset-mult.py --dataDir dataset/audio --outDir ./wavs --process 24
usage: alignDataset-mult.py [-h] [--dataDir DATADIR] [--outDir OUTDIR]
[--process PROCESS]
align smaplerate of dataset
optional arguments:
-h, --help show this help message and exit
--dataDir DATADIR path to the input dir
--outDir OUTDIR path to the output dir
--process PROCESS number of process running
bash kill_align.sh
Find transformation failures at bad_aligns.txt
- [12.29] multiprocess download
- [1.1] Automated killing script for multiprocess program
- [1.1] Bad url backup for trace back
- define sample rate prior to download
Feel free to file an issue had you encountered any problems. Have fun!