Workshop: Preparing Dataset for Analysis¶
- LABDAcademy 2 (27–31.05.2024, Trondheim, Norway)
- Lecturers: Marian Paiva Marchiori, Josef Heidler
Want to try experimenting with a Jupyter notebook online? It's super easy with Binder! Look for the Binder button. Clicking it will launch an online environment where you can execute the code yourself.
Introduction¶
Building on Jasper's introduction to combining sensor data, this workshop empowers you to take the next step. We'll guide you through using our Python package to load, combine and analyze the data you collected from both your GPS and accelerometer devices.
The documentation is available at https://labda.josefheidler.cz/
1. Standardizing Files¶
One of the inherent challenges of collecting movement behavior data is the inconsistency in file formats across different sensors and vendors. These disparities can significantly impede your analysis workflow, necessitating time-consuming and potentially error-prone data manipulation.
This workshop aims to empower you to overcome this hurdle. We'll introduce you to device readers, also known as parsers. These powerful tools act as a bridge, seamlessly handling a variety of file types including CSV, JSON, GPX, and even vendor-specific binary formats. By utilizing device readers, you can efficiently transform your raw data into a standardized format, streamlining the analysis process and allowing you to focus on extracting valuable insights from your movement behavior data.
Standard Format: Imagine your data as a well-organized spreadsheet, where each column has a clear label and the information inside is consistent. This makes it a breeze to analyze!
Example Data: Need some data to play around with? We've got your back! We have some test files you can use located at "data/sens" and "data/traccar" (run the next cell to see a list of the files). Feel free to use your own though!
from pathlib import Path
files = [file.as_posix() for file in Path("data/sens").iterdir()] + [
file.as_posix() for file in Path("data/traccar").iterdir()
]
files
['data/sens/export_759DE5.csv', 'data/traccar/759DE5.parquet']
1.1 Reading GPS Files (Traccar)¶
Traccar offers a free, open-source solution for researchers to gather GPS data from many participants. It achieves this through a robust client-server architecture, ensuring the security and reliability of your data collection process.
Downloading data from Traccar requires three things: your login credentials (username and password), the server's web address (URL), and who you want data for (subject's ID).
Additional information regarding LABDA's approach to parsing data from the Traccar application can be found here.
# This cell is only relevant for those using the Traccar server in the workshop.
# If you are using the Traccar server, please fill in your subject ID.
from datetime import datetime
from labda.parsers import Traccar
subject_id = "759DE5" # Change this to your ID
gps = Traccar.from_server(
url="https://gps.josefheidler.cz",
username="trondheim",
password="trondheim",
subject_id=subject_id,
start=datetime(2024, 5, 27),
end=datetime(2024, 5, 31),
accuracy_limit=200,
environment_limit=25,
)
2024-05-30 08:46:57 | INFO | labda.parsers.traccar.from_server | 759DE5 | Parsed 10212 records (SF: 15.0s, TZ: Europe/Oslo, CRS: EPSG:32632) from: https://gps.josefheidler.cz (Traccar).
In case of Traccar server connectivity issues, pre-processed files on the local disk can be used.
from labda import Subject
# Provide the path to the parquet file containing the GPS data. Check the data folder for the available files.
path = "data/traccar/759DE5.parquet"
gps = Subject.from_parquet(path)
After you've read (parsed) the Traccar file, you'll gain access to the Subject.
The Subject acts like a central hub for your data. It holds two important pieces of information:
Metadata: This provides details about the parsed data itself, like the subject's ID, the specific device (sensor) it came from, and the timezone and coordinate system used. Think of it as the data's "passport" that tells you its origin and format.
Dataframe: This is the data itself, organized like a spreadsheet. Each column has a clear name, and the rows contain the specific values you're interested in.
gps.metadata
Metadata(id='759DE5', sensor=[Sensor(id='50', serial_number=None, model=None, vendor=<Vendor.TRACCAR: 'Traccar'>, firmware_version=None, extra=None)], sampling_frequency=15.0, crs='EPSG:32632', timezone='Europe/Oslo')
We're smart enough to guess some things from your data! Got GPS coordinates? We can likely figure out your time zone and coordinate reference system (CRS) automatically. Sampling frequency (how often data is collected) is usually easy for us to detect too. But hey, you're always in control – feel free to provide this information yourself or even override our guesses.
gps.df
motion | latitude | longitude | gnss_accuracy | distance | elevation | speed | environment | |
---|---|---|---|---|---|---|---|---|
datetime | ||||||||
2024-05-27 13:40:30 | False | 7034084.115065 | 569292.445311 | 11.484 | 894158.625 | 56.900002 | 0.0 | outdoor |
2024-05-27 13:40:45 | False | 7034083.578055 | 569292.866324 | 11.498 | 0.681363 | 56.900002 | 0.0 | outdoor |
2024-05-27 13:41:15 | False | 7034083.283318 | 569292.123862 | 11.526 | 0.797084 | 56.900002 | 0.0 | outdoor |
2024-05-27 13:41:30 | False | 7034084.053956 | 569292.711229 | 11.496 | 0.967489 | 56.900002 | 0.0 | outdoor |
2024-05-27 13:42:00 | False | 7034082.678059 | 569292.481454 | 11.48 | 1.393498 | 56.900002 | 0.0 | outdoor |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2024-05-30 08:45:15 | True | 7034513.632188 | 569673.198829 | 14.125 | 5.337112 | 52.0 | 1.039869 | outdoor |
2024-05-30 08:45:45 | True | 7034513.461384 | 569673.032835 | 24.632999 | 0.237777 | 53.100002 | 0.136467 | outdoor |
2024-05-30 08:46:00 | True | 7034513.426222 | 569672.953734 | 25.813 | 0.086378 | 52.100002 | 0.057855 | indoor |
2024-05-30 08:46:15 | True | 7034516.26338 | 569675.792054 | 26.905001 | 4.006353 | 52.600002 | 3.288865 | indoor |
2024-05-30 08:46:30 | True | 7034513.369759 | 569672.920023 | 9.0 | 4.070058 | 52.600002 | 0.781312 | outdoor |
10212 rows × 8 columns
There absolutely should be a guide to everything in this dataframe, but with only 27 hours a day, you know how it goes! We're working on describing all those columns and possible values, just give us a little extra time.
Note: It's important to note that the raw data wasn't originally time-aligned to seconds. However, our parser intelligently handled this and formatted the data correctly. This makes it much easier for you to work with. We're also actively developing support for hertz units, and plan to offer optional alignment to whole seconds, milliseconds or hertz in the future. This will give you even more control over how your data is presented.
Function help: Need a function refresher? No problem! In Jupyter notebooks, simply type the function name with a question mark (?) for a quick summary. Want more details? The documentation website has a complete rundown of the function's purpose and the specific inputs it requires. Be warned, though, the documentation might still be a work in progress.
from labda.parsers import Traccar
Traccar.from_server?
Signature: Traccar.from_server( url: str, username: str, password: str, subject_id: str, *, start: datetime.datetime | None = None, end: datetime.datetime | None = None, sampling_frequency: float | None = None, crs: str | None = 'infer', timezone: str | None = 'infer', sensor_id: str | None = None, vendor: labda.structure.subject.Vendor = <Vendor.TRACCAR: 'Traccar'>, model: str | None = None, serial_number: str | None = None, firmware_version: str | None = None, accuracy_limit: int | float | None = None, environment_limit: int | float | None = None, ) -> labda.structure.subject.Subject Docstring: The Traccar API allows you to download data from a specific server (URL) using your login credentials (username and password). To obtain data for a particular device, you must specify a subject ID (which corresponds to the device identifier). Optionally, you can define a date and time range to download data for a specific timeframe. If no date range is provided (from/to), all data for the chosen device will be downloaded. The downloaded data will be parsed and validated according to the schema defined in the structure module. Args: url (str): The URL of the Traccar server. username (str): The username to authenticate with the server. password (str): The password to authenticate with the server. subject_id (str): The ID of the subject to fetch data for (corresponds to the device identifier). start (datetime, optional): The start time of the data to fetch. If not provided, it will fetch all available data. end (datetime, optional): The end time of the data to fetch. If not provided, it will fetch all available data. sampling_frequency (float, optional): The sampling frequency of the data. If not provided, it will be infered from data. timezone (str, optional): The timezone of the data. If set to "infer" it will be inferred from the data. If None it will be set to the local timezone. Otherwise, it will be set to the provided timezone. crs (str, optional): The coordinate reference system (CRS) of the data. If set to "infer" it will be inferred from the data. If None it will be set to "EPSG:4326". Otherwise, it will be set to the provided CRS. sensor_id (str, optional): The ID of the sensor to fetch data for. If not provided, unique device ID from Traccar server will be used (if available). vendor (Vendor, optional): The vendor of the device. Defaults to "Traccar". model (str, optional): The model of the device. If not provided, it will be fetched from the Traccar server (device - extra, phone + model) if available. serial_number (str, optional): The serial number of the device. firmware_version (str, optional): The firmware version of the device. accuracy_limit (int | float, optional): The maximum allowed GPS accuracy. If not provided, all points will be included, otherwise only points with accuracy less to the provided value will be included. environment_limit (int | float, optional): The maximum allowed GPS accuracy to determine if the point is indoor or outdoor. If point accuracy is less to the provided value, the point will be marked as indoor, otherwise it will be marked as outdoor. If not provided, environment will not be detected. Returns: Subject: A Subject object containing the fetched and processed dataframe containing information: datetime, latitude, longitude, gps_accuracy, distance, elevation, and speed. Raises: HTTPError: HTTP request to the Traccar server failed. ValueError: No device is found for the specified subject ID. ValueError: No records are found for the specified device. ValueError: Environment limit must be lower than accuracy limit. Examples: Here's how to call the function with just the minimum required parameters. ```python from labda.parsers import Traccar subject = Traccar.from_server( url="http://gps.example.com", username="admin", password="pwd9000", subject_id="john_doe", ) ``` File: ~/projects/labda/labda/parsers/traccar.py Type: function
1.2 Read Accelerometer File (SENS)¶
SENS offers a convenient way to gather physical activity data from large groups. This integrated system utilizes wireless accelerometers that automatically send data to a secure cloud storage. Its ease of use makes it ideal for healthcare and research projects. SENS stores their activity data in a unique structure. You can grab it directly from their web app. Once you do, our reader (parser) can easily translate it into our standardized format, making analyzing your activity data a breeze.
Parsing a SENS CSV file only requires the file path. However, specifying the desired output timezone is recommended, otherwise it will be in UTC.
Check out how LABDA tackles parsing CSV data files generated by SENS.
from labda.parsers import Sens
path = "data/sens/export_759DE5.csv" # Use the path to your data file. Check the data folder for examples.
timezone = gps.metadata.timezone # While our example files come in various time zones, the ideal approach is to automatically extract the correct timezone from your GPS data.
acl = Sens.from_csv(
path=path,
timezone=timezone,
)
2024-05-30 08:49:35 | INFO | labda.parsers.sens.from_csv | export_759DE5 | Parsed 56527 records (SF: 5.0s, TZ: Europe/Oslo) from: export_759DE5.csv (Sens).
Note: See those informative messages printed in color after each function runs? Those are actually our logs, keeping track of everything that happens as your data is processed! Although logging information is currently undergoing testing and might not capture everything flawlessly, it's designed to become a comprehensive record of function usage. We strive for extensive logging to guarantee transparency within automated data pipelines. This will empower researchers to follow their data's journey and pinpoint any problems during pipeline execution. In the future, these logs will be savable in structured formats like JSON, providing a convenient way to store and analyze logged data.
acl.metadata
Metadata(id='export_759DE5', sensor=[Sensor(id='export_759DE5', serial_number=None, model=None, vendor=<Vendor.SENS: 'Sens'>, firmware_version=None, extra=None)], sampling_frequency=5.0, crs=None, timezone='Europe/Oslo')
acl.df
wear | position | steps | activity_intensity | activity_value | activity | |
---|---|---|---|---|---|---|
datetime | ||||||
2024-05-27 00:00:05 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 00:00:10 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 00:00:15 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 00:00:20 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 00:00:25 | True | sitting-lying | 0 | sedentary | 0.89 | resting |
... | ... | ... | ... | ... | ... | ... |
2024-05-30 08:37:10 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-30 08:37:15 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-30 08:37:20 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-30 08:37:25 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-30 08:37:30 | True | sitting-lying | 0 | sedentary | 0.0 | resting |
56527 rows × 6 columns
2. Merge Data from Both Sensors¶
Remember how we talked about the power of multiple sensors on Monday – more data, more insights, more fun! But with all that info, you might be wondering: is merging sensor data a total headache? Spoiler alert: absolutely not! It's actually quite achievable.
from labda import merge_subjects
subject = merge_subjects(gps, acl)
2024-05-30 08:49:44 | ERROR | labda.structure.merging._check_ids | 759DE5; export_759DE5 | IDs do not match (left: 759DE5, right: export_759DE5).
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[14], line 3 1 from labda import merge_subjects ----> 3 subject = merge_subjects(gps, acl) File ~/projects/labda/labda/structure/merging.py:145, in merge_subjects(left, right, how, **kwargs) 136 logger.error( 137 message, 138 extra={ (...) 141 }, 142 ) 143 raise ValueError(message) --> 145 id = _check_ids(left, right) 146 sf = _check_sampling_frequency(left, right) 147 tz = _check_timezones(left, right) File ~/projects/labda/labda/structure/merging.py:48, in _check_ids(left, right) 38 message = ( 39 f"IDs do not match (left: {left.metadata.id}, right: {right.metadata.id})." 40 ) 41 logger.error( 42 message, 43 extra={ (...) 46 }, 47 ) ---> 48 raise ValueError(message) 50 return left.metadata.id ValueError: IDs do not match (left: 759DE5, right: export_759DE5).
Yikes! Mismatched subject IDs! You might be wondering why this matters. Imagine a study with over 1000 participants wearing accelerometers and GPS trackers. When you want to combine their data by participant, matching IDs are crucial. Why? Because if IDs don't match, some data might get left behind. To avoid this, we need to check if IDs exist in both datasets. If not, we need to investigate and potentially adjust the data before merging.
If needed, you have the option to simply adjust the ID for one of the sensor datasets.
acl.metadata.id = "759DE5"
After fixing the IDs, just run the merging function again.
subject = merge_subjects(gps, acl)
2024-05-30 08:49:49 | ERROR | labda.structure.merging._check_sampling_frequency | 759DE5 | Sampling frequency mismatch (left: 15.0s, right: 5.0s).
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[16], line 1 ----> 1 subject = merge_subjects(gps, acl) File ~/projects/labda/labda/structure/merging.py:146, in merge_subjects(left, right, how, **kwargs) 143 raise ValueError(message) 145 id = _check_ids(left, right) --> 146 sf = _check_sampling_frequency(left, right) 147 tz = _check_timezones(left, right) 148 crs = _check_crs(left, right) File ~/projects/labda/labda/structure/merging.py:63, in _check_sampling_frequency(left, right) 55 message = f"Sampling frequency mismatch (left: {left.metadata.sampling_frequency}s, right: {right.metadata.sampling_frequency}s)." 56 logger.error( 57 message, 58 extra={ (...) 61 }, 62 ) ---> 63 raise ValueError(message) 65 return left.metadata.sampling_frequency ValueError: Sampling frequency mismatch (left: 15.0s, right: 5.0s).
Ugh, another hurdle! We've encountered mismatched sampling frequencies. This means the data points aren't collected at the same intervals. In this case, the accelerometer data (collected every 5 seconds) needs to be downsampled to match the 15-second intervals of the GPS data. To address this, we've got a handy function that automatically downscales the higher-frequency data for you. This function relies on a set of rules called a "mapper" that defines how to handle each data column. For example, the mapper might sum the number of steps but take the average speed (from GPS). You have the flexibility to create your own custom mapper if needed. It's important to remember that downsampling (aggregation) can introduce errors. The rule of thumb is that the higher the downscaling factor, the less accurate the data might become.
acl.downsample(15)
2024-05-30 08:49:56 | INFO | Subject.upsample | 759DE5 | Subject's data downsampled from 5.0s to 15s.
After downsampling, run the merging function once again.
subject = merge_subjects(gps, acl)
2024-05-30 08:49:59 | INFO | labda.structure.merging.merge_subjects | 759DE5 | Merged 10183 records (50, export_759DE5).
The merged data now includes metadata for both sensors, allowing you to easily identify the origin of the data.
subject.metadata.sensor
[Sensor(id='50', serial_number=None, model=None, vendor=<Vendor.TRACCAR: 'Traccar'>, firmware_version=None, extra=None), Sensor(id='export_759DE5', serial_number=None, model=None, vendor=<Vendor.SENS: 'Sens'>, firmware_version=None, extra=None)]
subject.df
motion | latitude | longitude | gnss_accuracy | distance | elevation | speed | environment | wear | position | steps | activity_intensity | activity_value | activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||||
2024-05-27 13:40:30 | False | 7034084.115065 | 569292.445311 | 11.484 | 894158.625 | 56.900002 | 0.0 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 13:40:45 | False | 7034083.578055 | 569292.866324 | 11.498 | 0.681363 | 56.900002 | 0.0 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 13:41:15 | False | 7034083.283318 | 569292.123862 | 11.526 | 0.797084 | 56.900002 | 0.0 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 13:41:30 | False | 7034084.053956 | 569292.711229 | 11.496 | 0.967489 | 56.900002 | 0.0 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-27 13:42:00 | False | 7034082.678059 | 569292.481454 | 11.48 | 1.393498 | 56.900002 | 0.0 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2024-05-30 08:36:15 | True | 7034502.980077 | 569671.299592 | 48.216 | 7.782863 | 53.100002 | 0.025865 | indoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-30 08:36:30 | True | 7034508.325039 | 569660.845259 | 24.764 | 11.717518 | 52.100002 | 0.051499 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
2024-05-30 08:36:45 | True | 7034506.158289 | 569677.000718 | 44.108002 | 16.26247 | 52.600002 | 1.557108 | indoor | True | sitting-lying | 0 | sedentary | 0.01 | resting |
2024-05-30 08:37:00 | True | 7034501.701539 | 569667.868208 | 34.206001 | 10.140562 | 53.100002 | 1.767845 | indoor | True | sitting-lying | 0 | sedentary | 0.54 | resting |
2024-05-30 08:37:30 | True | 7034506.250361 | 569660.261525 | 24.555 | 8.845656 | 52.100002 | 0.045608 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting |
10183 rows × 14 columns
Note: The merge function is still under development, similar to other features. Currently, it offers limited options, focusing on an "inner join" approach. This means only data points present in both datasets are kept, and datasets cannot have identical columns to avoid duplicates. We understand the need for more advanced merging, particularly combining data from multiple identical sensors (like multiple accelerometers). This functionality is planned for future updates, allowing you to combine data from these sensors more effectively.
3. Expanding Your Data¶
Expand: Sensor data is powerful, but sometimes it needs a boost. Take GPS coordinates, for example. They hold a wealth of hidden information – distance traveled, speed, even acceleration. But extracting these insights can be a technical challenge. That's where our data expansion functions come in.
Recalculate: Sensor data are not immune to errors. Signal loss, hiccups, and freezes can introduce inaccuracies, especially in values like speed or distance. To address this, we offer the option to recalculate these values even if they already exist in a column. Just specify that you want to overwrite the existing data, and we'll ensure you have the most reliable data possible for your analysis.
subject.add_direction()
subject.add_timedelta()
subject.add_distance(overwrite=True)
subject.add_speed(overwrite=True)
subject.df
2024-05-30 08:50:10 | INFO | labda.structure.subject.add_direction | 759DE5 | Direction column added. 2024-05-30 08:50:10 | INFO | labda.structure.subject.add_timedelta | 759DE5 | Timedelta column added. 2024-05-30 08:50:10 | INFO | labda.structure.subject.add_distance | 759DE5 | Distance column added.
2024-05-30 08:50:10 | INFO | labda.structure.subject.add_speed | 759DE5 | Speed column added.
motion | latitude | longitude | gnss_accuracy | elevation | environment | wear | position | steps | activity_intensity | activity_value | activity | direction | timedelta | distance | speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||||||
2024-05-27 13:40:30 | False | 7034084.115065 | 569292.445311 | 11.484 | 56.900002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | <NA> | NaT | NaN | NaN |
2024-05-27 13:40:45 | False | 7034083.578055 | 569292.866324 | 11.498 | 56.900002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 330.66119 | 0 days 00:00:15 | 0.682372 | 0.163769 |
2024-05-27 13:41:15 | False | 7034083.283318 | 569292.123862 | 11.526 | 56.900002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 61.082443 | 0 days 00:00:30 | 0.798823 | 0.095859 |
2024-05-27 13:41:30 | False | 7034084.053956 | 569292.711229 | 11.496 | 56.900002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 209.072067 | 0 days 00:00:15 | 0.968959 | 0.232550 |
2024-05-27 13:42:00 | False | 7034082.678059 | 569292.481454 | 11.48 | 56.900002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 6.843455 | 0 days 00:00:30 | 1.394951 | 0.167394 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2024-05-30 08:36:15 | True | 7034502.980077 | 569671.299592 | 48.216 | 53.100002 | indoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 233.289403 | 0 days 00:00:30 | 7.800860 | 0.936103 |
2024-05-30 08:36:30 | True | 7034508.325039 | 569660.845259 | 24.764 | 52.100002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 202.855848 | 0 days 00:00:15 | 11.741452 | 2.817949 |
2024-05-30 08:36:45 | True | 7034506.158289 | 569677.000718 | 44.108002 | 52.600002 | indoor | True | sitting-lying | 0 | sedentary | 0.01 | resting | 73.518585 | 0 days 00:00:15 | 16.300113 | 3.912027 |
2024-05-30 08:37:00 | True | 7034501.701539 | 569667.868208 | 34.206001 | 53.100002 | indoor | True | sitting-lying | 0 | sedentary | 0.54 | resting | 329.253074 | 0 days 00:00:15 | 10.161956 | 2.438870 |
2024-05-30 08:37:30 | True | 7034506.250361 | 569660.261525 | 24.555 | 52.100002 | outdoor | True | sitting-lying | 0 | sedentary | 0.0 | resting | 198.332748 | 0 days 00:00:30 | 8.863036 | 1.063564 |
10183 rows × 16 columns
Note: The future is brimming with possibilities! We're overflowing with ideas for even more "amazing things" to integrate. Imagine effortlessly retrieving elevation data or unlocking weather information for any specific location and time – that's the power we're building together. We prioritize features that matter most to you, so tell us what you need!
4. Export and Import Files¶
We've covered a lot: reading various data formats, downsampling, merging, and even extracting new information. Now, to avoid redoing all that work, let's save your masterpiece!
Parquet is your friend here. This popular data format keeps files compact, loads them quickly, and lets you add valuable metadata. Plus, Parquet integrates seamlessly with other programming/scripting languages like R, Rust, Go or data warehouses, making it a future-proof choice for your data ecosystem. Save your work with confidence, knowing it'll be readily accessible whenever and wherever you need it.
path = "data/subject.parquet"
subject.to_parquet(path, overwrite=True)
2024-05-30 08:50:16 | INFO | labda.structure.subject.to_parquet | 759DE5 | Subject exported: data/subject.parquet
You can easily load subject data back into the environment. This process is fast and efficient, and all loaded data is validated to ensure it meets the LABDA standard format.
from labda import Subject
path = "data/subject.parquet"
imported_subject = Subject.from_parquet(path)
2024-05-30 08:50:19 | INFO | labda.structure.subject.from_parquet | 759DE5 | Subject imported: data/subject.parquet
imported_subject.metadata
Metadata(id='759DE5', sensor=[Sensor(id='50', serial_number=None, model=None, vendor=<Vendor.TRACCAR: 'Traccar'>, firmware_version=None, extra=None), Sensor(id='export_759DE5', serial_number=None, model=None, vendor=<Vendor.SENS: 'Sens'>, firmware_version=None, extra=None)], sampling_frequency=15.0, crs='EPSG:32632', timezone='Europe/Oslo')
imported_subject.df
wear | timedelta | position | steps | motion | latitude | longitude | gnss_accuracy | distance | elevation | speed | direction | environment | activity_intensity | activity_value | activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||||||
2024-05-27 13:40:30 | True | NaT | sitting-lying | 0 | False | 7034084.115065 | 569292.445311 | 11.484 | <NA> | 56.900002 | <NA> | <NA> | outdoor | sedentary | 0.0 | resting |
2024-05-27 13:40:45 | True | 0 days 00:00:15 | sitting-lying | 0 | False | 7034083.578055 | 569292.866324 | 11.498 | 0.682372 | 56.900002 | 0.163769 | 330.661194 | outdoor | sedentary | 0.0 | resting |
2024-05-27 13:41:15 | True | 0 days 00:00:30 | sitting-lying | 0 | False | 7034083.283318 | 569292.123862 | 11.526 | 0.798823 | 56.900002 | 0.095859 | 61.082443 | outdoor | sedentary | 0.0 | resting |
2024-05-27 13:41:30 | True | 0 days 00:00:15 | sitting-lying | 0 | False | 7034084.053956 | 569292.711229 | 11.496 | 0.968959 | 56.900002 | 0.23255 | 209.072067 | outdoor | sedentary | 0.0 | resting |
2024-05-27 13:42:00 | True | 0 days 00:00:30 | sitting-lying | 0 | False | 7034082.678059 | 569292.481454 | 11.48 | 1.394951 | 56.900002 | 0.167394 | 6.843455 | outdoor | sedentary | 0.0 | resting |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2024-05-30 08:36:15 | True | 0 days 00:00:30 | sitting-lying | 0 | True | 7034502.980077 | 569671.299592 | 48.216 | 7.80086 | 53.100002 | 0.936103 | 233.289398 | indoor | sedentary | 0.0 | resting |
2024-05-30 08:36:30 | True | 0 days 00:00:15 | sitting-lying | 0 | True | 7034508.325039 | 569660.845259 | 24.764 | 11.741452 | 52.100002 | 2.817949 | 202.85585 | outdoor | sedentary | 0.0 | resting |
2024-05-30 08:36:45 | True | 0 days 00:00:15 | sitting-lying | 0 | True | 7034506.158289 | 569677.000718 | 44.108002 | 16.300112 | 52.600002 | 3.912027 | 73.518585 | indoor | sedentary | 0.01 | resting |
2024-05-30 08:37:00 | True | 0 days 00:00:15 | sitting-lying | 0 | True | 7034501.701539 | 569667.868208 | 34.206001 | 10.161957 | 53.100002 | 2.438869 | 329.253082 | indoor | sedentary | 0.54 | resting |
2024-05-30 08:37:30 | True | 0 days 00:00:30 | sitting-lying | 0 | True | 7034506.250361 | 569660.261525 | 24.555 | 8.863036 | 52.100002 | 1.063564 | 198.332748 | outdoor | sedentary | 0.0 | resting |
10183 rows × 16 columns
5. Processing¶
With the datasets successfully merged and expanded, we can now proceed with the analysis phase to extract valuable insights.
5.1 Check Daily Wear Time¶
Before we take the next step and analyze your data, let's make sure your data is ready! It's important to check if you wore the sensors consistently throughout the academy program. Complete sensor data helps us get the most accurate picture of your progress. If there are any missing days, we can decide together how to handle them. We might need to exclude you from the analysis, or we can focus on the days with sensor data.
imported_subject.wear_time
wear_time | day | |
---|---|---|
datetime | ||
2024-05-27 | 0 days 06:18:15 | Monday |
2024-05-28 | 0 days 14:55:45 | Tuesday |
2024-05-29 | 0 days 14:39:00 | Wednesday |
2024-05-30 | 0 days 06:32:45 | Thursday |
5.2 Detect Trips and Transportation Mode¶
Alright, let's take it one step at a time. First things first, let's visualize the GPS data on a map.
imported_subject.plot("gps")