-
Notifications
You must be signed in to change notification settings - Fork 8
Using Stream2segment in your Python code
This notebook assumes you have already run the download subroutine (s2s download -c <config_file.yaml> on the terminal) and you desire to run custom processing on it
- Introduction
- Writing a simple processing function
- Iterating over selected segments to execute custom code
Stream2segment has two main functions for working with downloaded data, process and imap. Both functions run custom code (a so-called processing function pyfunc) on a selection of downloaded waveform segments. imap yields the output of pyfunc for each selected segment, process writes the output of pyfunc for each selected segment in a row of a chosen tabular file (HDF or CSV)
from stream2segment.process import process, imap
# you can type help(process) or help(imap) for documentationBy running the command s2s init on the terminal, a fully documented Python module (paramtable.py) with a processing function called main is available. The user can edit it immediately and in two ways:
- Produce the desired tabular output with the s2s process command (
s2s process -p paramtable.py ...) - Run customized code by invoking the module as script (
python paramtable.py), usually callingimaporprocessinto the usualif __name__ == "__main__":bolerplate code
In general, and especially for big data processing, these are the recommended way to process segments, as it avoids the unnecessary overhead of a Notebook: imap and process, if run within a notebook, are not guaranteed to fully leverage multiple processors on a given machine, if required (this depends on how the Notebook handles multi processing, which is outside our control), and will not be able to show a progress bar with the % of task done and the estimated reamining time
In most cases, the database where the data has been downloaded and ready needs a simple string URL. IMPORTANT: Pay attention to save, commit or distribute code with db URL password in it. For simplicity, a database URL is usually extracted from the download configuration file:
from stream2segment.process import yaml_load
# uncomment the line below using an existing file on your OS:
# dburl = yaml_load(download_file.yaml)['dburl']or written here following the RFC-1738 syntax as in this example, where we will use an example database (2 segments) provided in the same directory of this notebook and available if you run the s2s init command. If you have problem loading the database, check the current working directory and be sure the databse is in it:
# The database URL should be the parameter 'dburl' of the download configuration file.
# Here we use an example database (2 segments) provided in the same directory of this notebook:
import os
dburl = 'sqlite:///' + os.path.join(os.getcwd(), 'jupyter.example.db')The selection of suitable segments is performed by creating a dict mapping one or more Segment attributes to a selection expression for that attribute (for details on the segment object and its attributes, see the segment object)
from stream2segment.process import Segment, get_segments
from sqlalchemy.orm import load_only
# create the selection dict:
segments_selection = {
'has_data': 'true',
'maxgap_numsamples': '[-0.5, 0.5]',
'event_distance_deg': '[20, 90]'
# other optional attributes (see cheatsheet below for details):
# missing_data_sec: '<120'
# missing_data_ratio: '<0.5'
# id: '<300'
# event.time: "(2014-01-01T00:00:00, 2014-12-31T23:59:59)"
# event.latitude: "[24, 70]"
# event.longitude: "[-11, 24]"
}Each segment waveform data is stored as bytes sequence in the segment.data attribute. However, you seldom need to access this attribute directly: Stream2segment defines shortcut methods to work with the relative ObsPy Objects.
For instance, let's access the the ObsPy Stream representing the waveform data of our segment object fetched above (for details, see the segment object):
def my_processing_function(segment, config):
# Get ObsPy Stream object
stream = segment.stream()
# a Stream is a collection of traces, let's take the first one (copying it, see caveat below):
trace = stream[0].copy()
# Now let's remove the instrumental response of the segment strem
# Get ObsPy Inventory object:
inventory = segment.inventory()
# remove the response:
stream_remresp = stream.remove_response(inventory)
trace_remresp = stream_remresp[0]
# print('Trace data (response removed): ' + str(trace_rr.data))
return segment, trace, trace_remrespCaveat: The trace data has now been permanently modified. This is not due to Stream2segment but to a specific design choice of ObsPy. In other words, segment.stream() from now returns stream_remresp (the stream with the response removed!):
Similar to remove_response, several other Stream and Trace methods of ObsPy permanently modify the underlying data (please refer to their ObsPy documentation before applying them). In all of these cases, to recover the original trace, there are two strategies:
1] (recommended) Preserve segment.stream() using remove_response on a stream copy. or copying the original trace (as in the example above)
2] Reload the segment stream from the database with segment.stream(reload=True)
With the database URL, the selection of segment, and our processing function, we can now iullustrate a simple usage of, e.g., imap. For this purpose, we simply print the segment event magnitude, and the segment trace with and without the repsonse removed in order to show in the preactice what we just discussed
# and now run `imap`. The function yields the output of our proceessing function (segment, trace) and the
# segment_id
for (segment, trace, trace_remresp), segment_id in imap(my_processing_function, dburl, segments_selection):
print()
print('Segment Id: %d (event magnitude: %.1f)' % (segment_id, segment.event.magnitude))
print('Segment trace (first three points):')
print(' - Counts unit (no response removed): %s' % trace.data[:3])
print(' - Physical units (response removed): %s' % trace_remresp.data[:3])
print(' - As returned by `segment.stream()[0]`: %s' % segment.stream()[0].data[:3])
Segment Id: 2 (event magnitude: 8.1)
Segment trace (first three points):
- Counts unit (no response removed): [-314 -280 -251]
- Physical units (response removed): [ -1.42699395e-06 -1.43603990e-06 -1.42210201e-06]
- As returned by `segment.stream()[0]`: [ -1.42699395e-06 -1.43603990e-06 -1.42210201e-06]
Segment Id: 1 (event magnitude: 8.1)
Segment trace (first three points):
- Counts unit (no response removed): [196 211 94]
- Physical units (response removed): [ 5.82973528e-07 5.26137453e-07 5.78937133e-07]
- As returned by `segment.stream()[0]`: [ 5.82973528e-07 5.26137453e-07 5.78937133e-07]
If you need even more customization, a more low level approach consists of simply work iteratively on the selected segments via get_segments. This is basically what imap and process do under the hood with a given segments selection:
from stream2segment.process import get_session, get_segments
segment = None
for seg in get_segments(dburl, segments_selection):
# do your work here... Let's just store the first segment and exit the loop:
segment = seg
breakget_segments opens a database session, yields selected segments and closes the session afterwards. A database session is an object that establishes all conversations with the database and represents a "holding zone" for all the objects which you’ve loaded or associated with it during its lifespan.
Closing a session is recommended after you finished your work as it releases memory on the computer and (if the db is remote) on the server, avoiding potential problems. Note that after a session is closed, all segment objects are detached from the database, which means we can not access anymore all of its properties, but only those previously loaded. E.g., accessing the segment related objects (e.g. the event object) outside the for loop, raises an error:
try:
seg.event
except Exception as exc:
print('ERROR: ' + str(exc))ERROR: Parent instance <Segment at 0x15aa1c0a0> is not bound to a Session; lazy load operation of attribute 'event' cannot proceed (Background on this error at: http://sqlalche.me/e/13/bhk3)
In very specific cases where you want to keep the segments and all related objects accessible (i.e. attached to a session) also outside a get_segments for-loop, you can call get_segments with a session object instead of a db url. Just remember to close the session manually at the end of your processing routine (see at the end of this notebook):
from stream2segment.process import get_session, get_segments
session = get_session(dburl)
for seg in get_segments(session, segments_selection):
# Let's just store again the first segment and exit the loop:
segment = seg
breakWith a session open, we can safely inspect the segment object. We have already seen all segments selectable attributes and some method (stream, inventory) and also related objects such as segment.event which are lazy loaded for performance reasons. These related Python objects are an extremely handy feature otherwise complex to achieve without a database, with waveforms and metadata stored as files on your computer. Let's inspect some related objects (these objects are loaded into memory only upon access, and therefore we need the db session to be open):
Event object:
evt = seg.event
print(str(evt))Event
attributes (16 of 16 loaded):
event_id: 20170908_0 (str, 16 characters, showing first 10 only)
time: 2017-09-08 04:49:21.200000 (datetime)
latitude: 15.02 (float)
longitude: -93.81 (float)
depth_km: 72.0 (float)
author: EMSC (str)
catalog: EMSC-RTS (str)
contributor: EMSC (str)
contributor_id: 616600 (str)
mag_type: mw (str)
magnitude: 8.1 (float)
mag_author: EMSC (str)
event_location_name: OFFSHORE C (str, 24 characters, showing first 10 only)
event_type: None (NoneType)
webservice_id: 1 (int)
id: 1 (int)
related_objects (0 of 1 loaded):
segments
Note that the event has the back-reference segments, which is a list of Segment objects because by design one segment is always related to one event, whereas one event generates many recordings at different stations, and thus is related to many segments. (be aware of potential memory problems when accessing huge lists of related objects. For details, see the segment object).
The same kind of "segments relation" holds for boith the Station and Channel objects (see below for details).
Station object:
sta = seg.station
print(str(sta))Station
attributes (11 of 11 loaded):
network: GE (str)
station: RUE (str)
latitude: 52.4759 (float)
longitude: 13.78 (float)
elevation: 40.0 (float)
site_name: None (NoneType)
start_time: 2012-03-21 10:00:00 (datetime)
end_time: None (NoneType)
inventory_xml: b'\x1f\x8b\x08\x00\xa4\x99\x1b\\\x02\xff' (bytes, 44710 elements, showing first 10 only)
datacenter_id: 1 (int)
id: 2 (int)
related_objects (0 of 3 loaded):
datacenter
channels
segments
Other accessible attributes not shown above are (for a complete list see below):
print(sta.netsta_code)GE.RUE
Channel object:
cha = seg.channel
print(str(cha))Channel
attributes (12 of 12 loaded):
location: (str)
channel: BHZ (str)
depth: 3.0 (float)
azimuth: 0.0 (float)
dip: -90.0 (float)
sensor_description: GFZ:GE1993 (str, 25 characters, showing first 10 only)
scale: 588000000.0 (float)
scale_freq: 0.02 (float)
scale_units: M/S (str)
sample_rate: 20.0 (float)
station_id: 2 (int)
id: 2 (int)
related_objects (0 of 2 loaded):
station
segments
Other accessible attributes not shown above are (for a complete list see below):
print(cha.band_code)
print(cha.instrument_code)
print(cha.orientation_code)B
H
Z
Caveat When working with lists of objects, like the query object above, because Stream2segment is designed for massive downloads, it is better to load only each object id, deferring the download of all other attributes upon access: this is what .options(load_only('id')) above does (note that "id" is an attribute common to all objects types: Segment , Event, Station, and so on).
We suggest to use the same approach for loading lists of related objects, e.g.:
evt = seg.event
# load event related segments (*risk of memory overflow: low):
segments = evt.segments.options(load_only('id')).all()
cha = seg.channel
# load channel related segments (*risk of memory overflow: medium):
segments = cha.segments.options(load_only('id')).all()
sta = seg.station
# load station related segments (*risk of memory overflow: high):
segments = sta.segments.options(load_only('id')).all()
dct = seg.datacenter
# load data center (e.g. IRIS) related segments (*risk of memory overflow: very high):
segments = dct.segments.options(load_only('id')).all()* The levels of risk reported are just heuristically estimated and have to be considered reliable only relative to each other (an event has almost certainly less related segments than a channel, which has almost certainly less related segments than a station, and so on)
In any case, for really memory consuming or slow tasks, consider moving the Notebook code into a custom Python module and use the command s2s process, which is specifically designed to better manage memory and performance
# finally close session
from stream2segment.process import close_session
close_session(session)True
This program is released under the GNU GENERAL PUBLIC LICENSE Version 3
-
Research article: Riccardo Zaccarelli, Dino Bindi, Angelo Strollo, Javier Quinteros and Fabrice Cotton. Stream2segment: An Open‐Source Tool for Downloading, Processing, and Visualizing Massive Event‐Based Seismic Waveform Datasets. Seismological Research Letters (2019) https://doi.org/10.1785/0220180314
-
Software: Zaccarelli, Riccardo (2018): Stream2segment: a tool to download, process and visualize event-based seismic waveform data. V. 2.7.3. GFZ Data Services. http://doi.org/10.5880/GFZ.2.4.2019.002