Workshop: Preparing Dataset for Analysis¶

LABDAcademy 2 (27–31.05.2024, Trondheim, Norway)
Lecturers: Marian Paiva Marchiori, Josef Heidler

Want to try experimenting with a Jupyter notebook online? It's super easy with Binder! Look for the Binder button. Clicking it will launch an online environment where you can execute the code yourself.

Introduction¶

Building on Jasper's introduction to combining sensor data, this workshop empowers you to take the next step. We'll guide you through using our Python package to load, combine and analyze the data you collected from both your GPS and accelerometer devices.

The documentation is available at https://labda.josefheidler.cz/

1. Standardizing Files¶

One of the inherent challenges of collecting movement behavior data is the inconsistency in file formats across different sensors and vendors. These disparities can significantly impede your analysis workflow, necessitating time-consuming and potentially error-prone data manipulation.

This workshop aims to empower you to overcome this hurdle. We'll introduce you to device readers, also known as parsers. These powerful tools act as a bridge, seamlessly handling a variety of file types including CSV, JSON, GPX, and even vendor-specific binary formats. By utilizing device readers, you can efficiently transform your raw data into a standardized format, streamlining the analysis process and allowing you to focus on extracting valuable insights from your movement behavior data.

Standard Format: Imagine your data as a well-organized spreadsheet, where each column has a clear label and the information inside is consistent. This makes it a breeze to analyze!

Example Data: Need some data to play around with? We've got your back! We have some test files you can use located at "data/sens" and "data/traccar" (run the next cell to see a list of the files). Feel free to use your own though!

In [1]:

Copied!





from pathlib import Path

files = [file.as_posix() for file in Path("data/sens").iterdir()] + [
    file.as_posix() for file in Path("data/traccar").iterdir()
]
files
from pathlib import Path

files = [file.as_posix() for file in Path("data/sens").iterdir()] + [
    file.as_posix() for file in Path("data/traccar").iterdir()
]
files

Out[1]:

['data/sens/export_759DE5.csv', 'data/traccar/759DE5.parquet']

1.1 Reading GPS Files (Traccar)¶

Traccar offers a free, open-source solution for researchers to gather GPS data from many participants. It achieves this through a robust client-server architecture, ensuring the security and reliability of your data collection process.

Downloading data from Traccar requires three things: your login credentials (username and password), the server's web address (URL), and who you want data for (subject's ID).

Additional information regarding LABDA's approach to parsing data from the Traccar application can be found here.

In [6]:

Copied!





# This cell is only relevant for those using the Traccar server in the workshop.
# If you are using the Traccar server, please fill in your subject ID.

from datetime import datetime

from labda.parsers import Traccar

subject_id = "759DE5"  # Change this to your ID

gps = Traccar.from_server(
    url="https://gps.josefheidler.cz",
    username="trondheim",
    password="trondheim",
    subject_id=subject_id,
    start=datetime(2024, 5, 27),
    end=datetime(2024, 5, 31),
    accuracy_limit=200,
    environment_limit=25,
)
# This cell is only relevant for those using the Traccar server in the workshop.
# If you are using the Traccar server, please fill in your subject ID.

from datetime import datetime

from labda.parsers import Traccar

subject_id = "759DE5"  # Change this to your ID

gps = Traccar.from_server(
    url="https://gps.josefheidler.cz",
    username="trondheim",
    password="trondheim",
    subject_id=subject_id,
    start=datetime(2024, 5, 27),
    end=datetime(2024, 5, 31),
    accuracy_limit=200,
    environment_limit=25,
)

2024-05-30 08:46:57 | INFO | labda.parsers.traccar.from_server | 759DE5 | Parsed 10212 records (SF: 15.0s, TZ: Europe/Oslo, CRS: EPSG:32632) from: https://gps.josefheidler.cz (Traccar).

In case of Traccar server connectivity issues, pre-processed files on the local disk can be used.

In [ ]:

Copied!

from labda import Subject

# Provide the path to the parquet file containing the GPS data. Check the data folder for the available files.
path = "data/traccar/759DE5.parquet"
gps = Subject.from_parquet(path)
from labda import Subject

# Provide the path to the parquet file containing the GPS data. Check the data folder for the available files.
path = "data/traccar/759DE5.parquet"
gps = Subject.from_parquet(path)

After you've read (parsed) the Traccar file, you'll gain access to the Subject.

The Subject acts like a central hub for your data. It holds two important pieces of information:

Metadata: This provides details about the parsed data itself, like the subject's ID, the specific device (sensor) it came from, and the timezone and coordinate system used. Think of it as the data's "passport" that tells you its origin and format.
Dataframe: This is the data itself, organized like a spreadsheet. Each column has a clear name, and the rows contain the specific values you're interested in.

In [7]:

Copied!

gps.metadata
gps.metadata

Out[7]:

Metadata(id='759DE5', sensor=[Sensor(id='50', serial_number=None, model=None, vendor=<Vendor.TRACCAR: 'Traccar'>, firmware_version=None, extra=None)], sampling_frequency=15.0, crs='EPSG:32632', timezone='Europe/Oslo')

We're smart enough to guess some things from your data! Got GPS coordinates? We can likely figure out your time zone and coordinate reference system (CRS) automatically. Sampling frequency (how often data is collected) is usually easy for us to detect too. But hey, you're always in control – feel free to provide this information yourself or even override our guesses.

In [8]:

Copied!

gps.df
gps.df

Out[8]:

	motion	latitude	longitude	gnss_accuracy	distance	elevation	speed	environment
datetime
2024-05-27 13:40:30	False	7034084.115065	569292.445311	11.484	894158.625	56.900002	0.0	outdoor
2024-05-27 13:40:45	False	7034083.578055	569292.866324	11.498	0.681363	56.900002	0.0	outdoor
2024-05-27 13:41:15	False	7034083.283318	569292.123862	11.526	0.797084	56.900002	0.0	outdoor
2024-05-27 13:41:30	False	7034084.053956	569292.711229	11.496	0.967489	56.900002	0.0	outdoor
2024-05-27 13:42:00	False	7034082.678059	569292.481454	11.48	1.393498	56.900002	0.0	outdoor
...	...	...	...	...	...	...	...	...
2024-05-30 08:45:15	True	7034513.632188	569673.198829	14.125	5.337112	52.0	1.039869	outdoor
2024-05-30 08:45:45	True	7034513.461384	569673.032835	24.632999	0.237777	53.100002	0.136467	outdoor
2024-05-30 08:46:00	True	7034513.426222	569672.953734	25.813	0.086378	52.100002	0.057855	indoor
2024-05-30 08:46:15	True	7034516.26338	569675.792054	26.905001	4.006353	52.600002	3.288865	indoor
2024-05-30 08:46:30	True	7034513.369759	569672.920023	9.0	4.070058	52.600002	0.781312	outdoor

10212 rows × 8 columns

There absolutely should be a guide to everything in this dataframe, but with only 27 hours a day, you know how it goes! We're working on describing all those columns and possible values, just give us a little extra time.

Note: It's important to note that the raw data wasn't originally time-aligned to seconds. However, our parser intelligently handled this and formatted the data correctly. This makes it much easier for you to work with. We're also actively developing support for hertz units, and plan to offer optional alignment to whole seconds, milliseconds or hertz in the future. This will give you even more control over how your data is presented.

Function help: Need a function refresher? No problem! In Jupyter notebooks, simply type the function name with a question mark (?) for a quick summary. Want more details? The documentation website has a complete rundown of the function's purpose and the specific inputs it requires. Be warned, though, the documentation might still be a work in progress.

In [10]:

Copied!

from labda.parsers import Traccar

Traccar.from_server?
from labda.parsers import Traccar

Traccar.from_server?

Signature:
Traccar.from_server(
    url: str,
    username: str,
    password: str,
    subject_id: str,
    *,
    start: datetime.datetime | None = None,
    end: datetime.datetime | None = None,
    sampling_frequency: float | None = None,
    crs: str | None = 'infer',
    timezone: str | None = 'infer',
    sensor_id: str | None = None,
    vendor: labda.structure.subject.Vendor = <Vendor.TRACCAR: 'Traccar'>,
    model: str | None = None,
    serial_number: str | None = None,
    firmware_version: str | None = None,
    accuracy_limit: int | float | None = None,
    environment_limit: int | float | None = None,
) -> labda.structure.subject.Subject
Docstring:
The Traccar API allows you to download data from a specific server (URL) using your login credentials (username and password). To obtain data for a particular device, you must specify a subject ID (which corresponds to the device identifier). Optionally, you can define a date and time range to download data for a specific timeframe. If no date range is provided (from/to), all data for the chosen device will be downloaded.

The downloaded data will be parsed and validated according to the schema defined in the structure module.

Args:
    url (str): The URL of the Traccar server.
    username (str): The username to authenticate with the server.
    password (str): The password to authenticate with the server.
    subject_id (str): The ID of the subject to fetch data for (corresponds to the device identifier).
    start (datetime, optional): The start time of the data to fetch. If not provided, it will fetch all available data.
    end (datetime, optional): The end time of the data to fetch. If not provided, it will fetch all available data.
    sampling_frequency (float, optional): The sampling frequency of the data. If not provided, it will be infered from data.
    timezone (str, optional): The timezone of the data. If set to "infer" it will be inferred from the data. If None it will be set to the local timezone. Otherwise, it will be set to the provided timezone.
    crs (str, optional): The coordinate reference system (CRS) of the data. If set to "infer" it will be inferred from the data. If None it will be set to "EPSG:4326". Otherwise, it will be set to the provided CRS.
    sensor_id (str, optional): The ID of the sensor to fetch data for. If not provided, unique device ID from Traccar server will be used (if available).
    vendor (Vendor, optional): The vendor of the device. Defaults to "Traccar".
    model (str, optional): The model of the device. If not provided, it will be fetched from the Traccar server (device - extra, phone + model) if available.
    serial_number (str, optional): The serial number of the device.
    firmware_version (str, optional): The firmware version of the device.
    accuracy_limit (int | float, optional): The maximum allowed GPS accuracy. If not provided, all points will be included, otherwise only points with accuracy less to the provided value will be included.
    environment_limit (int | float, optional): The maximum allowed GPS accuracy to determine if the point is indoor or outdoor. If point accuracy is less to the provided value, the point will be marked as indoor, otherwise it will be marked as outdoor. If not provided, environment will not be detected.

Returns:
    Subject: A Subject object containing the fetched and processed dataframe containing information: datetime, latitude, longitude, gps_accuracy, distance, elevation, and speed.


Raises:
    HTTPError: HTTP request to the Traccar server failed.
    ValueError: No device is found for the specified subject ID.
    ValueError: No records are found for the specified device.
    ValueError: Environment limit must be lower than accuracy limit.

Examples:
    Here's how to call the function with just the minimum required parameters.

    ```python
    from labda.parsers import Traccar

    subject = Traccar.from_server(
        url="http://gps.example.com",
        username="admin",
        password="pwd9000",
        subject_id="john_doe",
    )
    ```
File:      ~/projects/labda/labda/parsers/traccar.py
Type:      function

1.2 Read Accelerometer File (SENS)¶

SENS offers a convenient way to gather physical activity data from large groups. This integrated system utilizes wireless accelerometers that automatically send data to a secure cloud storage. Its ease of use makes it ideal for healthcare and research projects. SENS stores their activity data in a unique structure. You can grab it directly from their web app. Once you do, our reader (parser) can easily translate it into our standardized format, making analyzing your activity data a breeze.

Parsing a SENS CSV file only requires the file path. However, specifying the desired output timezone is recommended, otherwise it will be in UTC.

Check out how LABDA tackles parsing CSV data files generated by SENS.

In [11]:

Copied!





from labda.parsers import Sens

path = "data/sens/export_759DE5.csv"  # Use the path to your data file. Check the data folder for examples.
timezone = gps.metadata.timezone  # While our example files come in various time zones, the ideal approach is to automatically extract the correct timezone from your GPS data.

acl = Sens.from_csv(
    path=path,
    timezone=timezone,
)
from labda.parsers import Sens

path = "data/sens/export_759DE5.csv"  # Use the path to your data file. Check the data folder for examples.
timezone = gps.metadata.timezone  # While our example files come in various time zones, the ideal approach is to automatically extract the correct timezone from your GPS data.

acl = Sens.from_csv(
    path=path,
    timezone=timezone,
)

2024-05-30 08:49:35 | INFO | labda.parsers.sens.from_csv | export_759DE5 | Parsed 56527 records (SF: 5.0s, TZ: Europe/Oslo) from: export_759DE5.csv (Sens).

Note: See those informative messages printed in color after each function runs? Those are actually our logs, keeping track of everything that happens as your data is processed! Although logging information is currently undergoing testing and might not capture everything flawlessly, it's designed to become a comprehensive record of function usage. We strive for extensive logging to guarantee transparency within automated data pipelines. This will empower researchers to follow their data's journey and pinpoint any problems during pipeline execution. In the future, these logs will be savable in structured formats like JSON, providing a convenient way to store and analyze logged data.

In [12]:

Copied!

acl.metadata
acl.metadata

Out[12]:

Metadata(id='export_759DE5', sensor=[Sensor(id='export_759DE5', serial_number=None, model=None, vendor=<Vendor.SENS: 'Sens'>, firmware_version=None, extra=None)], sampling_frequency=5.0, crs=None, timezone='Europe/Oslo')

In [13]:

Copied!

acl.df
acl.df

Out[13]:

	wear	position	steps	activity_intensity	activity_value	activity
datetime
2024-05-27 00:00:05	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 00:00:10	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 00:00:15	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 00:00:20	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 00:00:25	True	sitting-lying	0	sedentary	0.89	resting
...	...	...	...	...	...	...
2024-05-30 08:37:10	True	sitting-lying	0	sedentary	0.0	resting
2024-05-30 08:37:15	True	sitting-lying	0	sedentary	0.0	resting
2024-05-30 08:37:20	True	sitting-lying	0	sedentary	0.0	resting
2024-05-30 08:37:25	True	sitting-lying	0	sedentary	0.0	resting
2024-05-30 08:37:30	True	sitting-lying	0	sedentary	0.0	resting

56527 rows × 6 columns

2. Merge Data from Both Sensors¶

Remember how we talked about the power of multiple sensors on Monday – more data, more insights, more fun! But with all that info, you might be wondering: is merging sensor data a total headache? Spoiler alert: absolutely not! It's actually quite achievable.

In [14]:

Copied!

from labda import merge_subjects

subject = merge_subjects(gps, acl)
from labda import merge_subjects

subject = merge_subjects(gps, acl)

2024-05-30 08:49:44 | ERROR | labda.structure.merging._check_ids | 759DE5; export_759DE5 | IDs do not match (left: 759DE5, right: export_759DE5).

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 3
      1 from labda import merge_subjects
----> 3 subject = merge_subjects(gps, acl)

File ~/projects/labda/labda/structure/merging.py:145, in merge_subjects(left, right, how, **kwargs)
    136     logger.error(
    137         message,
    138         extra={
   (...)
    141         },
    142     )
    143     raise ValueError(message)
--> 145 id = _check_ids(left, right)
    146 sf = _check_sampling_frequency(left, right)
    147 tz = _check_timezones(left, right)

File ~/projects/labda/labda/structure/merging.py:48, in _check_ids(left, right)
     38     message = (
     39         f"IDs do not match (left: {left.metadata.id}, right: {right.metadata.id})."
     40     )
     41     logger.error(
     42         message,
     43         extra={
   (...)
     46         },
     47     )
---> 48     raise ValueError(message)
     50 return left.metadata.id

ValueError: IDs do not match (left: 759DE5, right: export_759DE5).

Yikes! Mismatched subject IDs! You might be wondering why this matters. Imagine a study with over 1000 participants wearing accelerometers and GPS trackers. When you want to combine their data by participant, matching IDs are crucial. Why? Because if IDs don't match, some data might get left behind. To avoid this, we need to check if IDs exist in both datasets. If not, we need to investigate and potentially adjust the data before merging.

If needed, you have the option to simply adjust the ID for one of the sensor datasets.

In [15]:

Copied!

acl.metadata.id = "759DE5"
acl.metadata.id = "759DE5"

After fixing the IDs, just run the merging function again.

In [16]:

Copied!

subject = merge_subjects(gps, acl)
subject = merge_subjects(gps, acl)

2024-05-30 08:49:49 | ERROR | labda.structure.merging._check_sampling_frequency | 759DE5 | Sampling frequency mismatch (left: 15.0s, right: 5.0s).

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 subject = merge_subjects(gps, acl)

File ~/projects/labda/labda/structure/merging.py:146, in merge_subjects(left, right, how, **kwargs)
    143     raise ValueError(message)
    145 id = _check_ids(left, right)
--> 146 sf = _check_sampling_frequency(left, right)
    147 tz = _check_timezones(left, right)
    148 crs = _check_crs(left, right)

File ~/projects/labda/labda/structure/merging.py:63, in _check_sampling_frequency(left, right)
     55     message = f"Sampling frequency mismatch (left: {left.metadata.sampling_frequency}s, right: {right.metadata.sampling_frequency}s)."
     56     logger.error(
     57         message,
     58         extra={
   (...)
     61         },
     62     )
---> 63     raise ValueError(message)
     65 return left.metadata.sampling_frequency

ValueError: Sampling frequency mismatch (left: 15.0s, right: 5.0s).

Ugh, another hurdle! We've encountered mismatched sampling frequencies. This means the data points aren't collected at the same intervals. In this case, the accelerometer data (collected every 5 seconds) needs to be downsampled to match the 15-second intervals of the GPS data. To address this, we've got a handy function that automatically downscales the higher-frequency data for you. This function relies on a set of rules called a "mapper" that defines how to handle each data column. For example, the mapper might sum the number of steps but take the average speed (from GPS). You have the flexibility to create your own custom mapper if needed. It's important to remember that downsampling (aggregation) can introduce errors. The rule of thumb is that the higher the downscaling factor, the less accurate the data might become.

In [17]:

Copied!

acl.downsample(15)
acl.downsample(15)

2024-05-30 08:49:56 | INFO | Subject.upsample | 759DE5 | Subject's data downsampled from 5.0s to 15s.

After downsampling, run the merging function once again.

In [18]:

Copied!

subject = merge_subjects(gps, acl)
subject = merge_subjects(gps, acl)

2024-05-30 08:49:59 | INFO | labda.structure.merging.merge_subjects | 759DE5 | Merged 10183 records (50, export_759DE5).

The merged data now includes metadata for both sensors, allowing you to easily identify the origin of the data.

In [19]:

Copied!

subject.metadata.sensor
subject.metadata.sensor

Out[19]:

[Sensor(id='50', serial_number=None, model=None, vendor=<Vendor.TRACCAR: 'Traccar'>, firmware_version=None, extra=None),
 Sensor(id='export_759DE5', serial_number=None, model=None, vendor=<Vendor.SENS: 'Sens'>, firmware_version=None, extra=None)]

In [20]:

Copied!

subject.df
subject.df

Out[20]:

	motion	latitude	longitude	gnss_accuracy	distance	elevation	speed	environment	wear	position	steps	activity_intensity	activity_value	activity
datetime
2024-05-27 13:40:30	False	7034084.115065	569292.445311	11.484	894158.625	56.900002	0.0	outdoor	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 13:40:45	False	7034083.578055	569292.866324	11.498	0.681363	56.900002	0.0	outdoor	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 13:41:15	False	7034083.283318	569292.123862	11.526	0.797084	56.900002	0.0	outdoor	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 13:41:30	False	7034084.053956	569292.711229	11.496	0.967489	56.900002	0.0	outdoor	True	sitting-lying	0	sedentary	0.0	resting
2024-05-27 13:42:00	False	7034082.678059	569292.481454	11.48	1.393498	56.900002	0.0	outdoor	True	sitting-lying	0	sedentary	0.0	resting
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2024-05-30 08:36:15	True	7034502.980077	569671.299592	48.216	7.782863	53.100002	0.025865	indoor	True	sitting-lying	0	sedentary	0.0	resting
2024-05-30 08:36:30	True	7034508.325039	569660.845259	24.764	11.717518	52.100002	0.051499	outdoor	True	sitting-lying	0	sedentary	0.0	resting
2024-05-30 08:36:45	True	7034506.158289	569677.000718	44.108002	16.26247	52.600002	1.557108	indoor	True	sitting-lying	0	sedentary	0.01	resting
2024-05-30 08:37:00	True	7034501.701539	569667.868208	34.206001	10.140562	53.100002	1.767845	indoor	True	sitting-lying	0	sedentary	0.54	resting
2024-05-30 08:37:30	True	7034506.250361	569660.261525	24.555	8.845656	52.100002	0.045608	outdoor	True	sitting-lying	0	sedentary	0.0	resting

10183 rows × 14 columns

Note: The merge function is still under development, similar to other features. Currently, it offers limited options, focusing on an "inner join" approach. This means only data points present in both datasets are kept, and datasets cannot have identical columns to avoid duplicates. We understand the need for more advanced merging, particularly combining data from multiple identical sensors (like multiple accelerometers). This functionality is planned for future updates, allowing you to combine data from these sensors more effectively.

3. Expanding Your Data¶

Expand: Sensor data is powerful, but sometimes it needs a boost. Take GPS coordinates, for example. They hold a wealth of hidden information – distance traveled, speed, even acceleration. But extracting these insights can be a technical challenge. That's where our data expansion functions come in.

Recalculate: Sensor data are not immune to errors. Signal loss, hiccups, and freezes can introduce inaccuracies, especially in values like speed or distance. To address this, we offer the option to recalculate these values even if they already exist in a column. Just specify that you want to overwrite the existing data, and we'll ensure you have the most reliable data possible for your analysis.

In [21]:

Copied!

subject.add_direction()
subject.add_timedelta()

subject.add_distance(overwrite=True)
subject.add_speed(overwrite=True)

subject.df
subject.add_direction()
subject.add_timedelta()

subject.add_distance(overwrite=True)
subject.add_speed(overwrite=True)

subject.df

2024-05-30 08:50:10 | INFO | labda.structure.subject.add_direction | 759DE5 | Direction column added.
2024-05-30 08:50:10 | INFO | labda.structure.subject.add_timedelta | 759DE5 | Timedelta column added.
2024-05-30 08:50:10 | INFO | labda.structure.subject.add_distance | 759DE5 | Distance column added.

2024-05-30 08:50:10 | INFO | labda.structure.subject.add_speed | 759DE5 | Speed column added.

Out[21]:

	motion	latitude	longitude	gnss_accuracy	elevation	environment	wear	position	steps	activity_intensity	activity_value	activity	direction	timedelta	distance	speed
datetime
2024-05-27 13:40:30	False	7034084.115065	569292.445311	11.484	56.900002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	<NA>	NaT	NaN	NaN
2024-05-27 13:40:45	False	7034083.578055	569292.866324	11.498	56.900002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	330.66119	0 days 00:00:15	0.682372	0.163769
2024-05-27 13:41:15	False	7034083.283318	569292.123862	11.526	56.900002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	61.082443	0 days 00:00:30	0.798823	0.095859
2024-05-27 13:41:30	False	7034084.053956	569292.711229	11.496	56.900002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	209.072067	0 days 00:00:15	0.968959	0.232550
2024-05-27 13:42:00	False	7034082.678059	569292.481454	11.48	56.900002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	6.843455	0 days 00:00:30	1.394951	0.167394
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2024-05-30 08:36:15	True	7034502.980077	569671.299592	48.216	53.100002	indoor	True	sitting-lying	0	sedentary	0.0	resting	233.289403	0 days 00:00:30	7.800860	0.936103
2024-05-30 08:36:30	True	7034508.325039	569660.845259	24.764	52.100002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	202.855848	0 days 00:00:15	11.741452	2.817949
2024-05-30 08:36:45	True	7034506.158289	569677.000718	44.108002	52.600002	indoor	True	sitting-lying	0	sedentary	0.01	resting	73.518585	0 days 00:00:15	16.300113	3.912027
2024-05-30 08:37:00	True	7034501.701539	569667.868208	34.206001	53.100002	indoor	True	sitting-lying	0	sedentary	0.54	resting	329.253074	0 days 00:00:15	10.161956	2.438870
2024-05-30 08:37:30	True	7034506.250361	569660.261525	24.555	52.100002	outdoor	True	sitting-lying	0	sedentary	0.0	resting	198.332748	0 days 00:00:30	8.863036	1.063564

10183 rows × 16 columns

Note: The future is brimming with possibilities! We're overflowing with ideas for even more "amazing things" to integrate. Imagine effortlessly retrieving elevation data or unlocking weather information for any specific location and time – that's the power we're building together. We prioritize features that matter most to you, so tell us what you need!

4. Export and Import Files¶

We've covered a lot: reading various data formats, downsampling, merging, and even extracting new information. Now, to avoid redoing all that work, let's save your masterpiece!

Parquet is your friend here. This popular data format keeps files compact, loads them quickly, and lets you add valuable metadata. Plus, Parquet integrates seamlessly with other programming/scripting languages like R, Rust, Go or data warehouses, making it a future-proof choice for your data ecosystem. Save your work with confidence, knowing it'll be readily accessible whenever and wherever you need it.

In [22]:

Copied!

path = "data/subject.parquet"
subject.to_parquet(path, overwrite=True)
path = "data/subject.parquet"
subject.to_parquet(path, overwrite=True)

2024-05-30 08:50:16 | INFO | labda.structure.subject.to_parquet | 759DE5 | Subject exported: data/subject.parquet

You can easily load subject data back into the environment. This process is fast and efficient, and all loaded data is validated to ensure it meets the LABDA standard format.

In [23]:

Copied!

from labda import Subject

path = "data/subject.parquet"
imported_subject = Subject.from_parquet(path)
from labda import Subject

path = "data/subject.parquet"
imported_subject = Subject.from_parquet(path)

2024-05-30 08:50:19 | INFO | labda.structure.subject.from_parquet | 759DE5 | Subject imported: data/subject.parquet

In [24]:

Copied!

imported_subject.metadata
imported_subject.metadata

Out[24]:

Metadata(id='759DE5', sensor=[Sensor(id='50', serial_number=None, model=None, vendor=<Vendor.TRACCAR: 'Traccar'>, firmware_version=None, extra=None), Sensor(id='export_759DE5', serial_number=None, model=None, vendor=<Vendor.SENS: 'Sens'>, firmware_version=None, extra=None)], sampling_frequency=15.0, crs='EPSG:32632', timezone='Europe/Oslo')

In [25]:

Copied!

imported_subject.df
imported_subject.df

Out[25]:

	wear	timedelta	position	steps	motion	latitude	longitude	gnss_accuracy	distance	elevation	speed	direction	environment	activity_intensity	activity_value	activity
datetime
2024-05-27 13:40:30	True	NaT	sitting-lying	0	False	7034084.115065	569292.445311	11.484	<NA>	56.900002	<NA>	<NA>	outdoor	sedentary	0.0	resting
2024-05-27 13:40:45	True	0 days 00:00:15	sitting-lying	0	False	7034083.578055	569292.866324	11.498	0.682372	56.900002	0.163769	330.661194	outdoor	sedentary	0.0	resting
2024-05-27 13:41:15	True	0 days 00:00:30	sitting-lying	0	False	7034083.283318	569292.123862	11.526	0.798823	56.900002	0.095859	61.082443	outdoor	sedentary	0.0	resting
2024-05-27 13:41:30	True	0 days 00:00:15	sitting-lying	0	False	7034084.053956	569292.711229	11.496	0.968959	56.900002	0.23255	209.072067	outdoor	sedentary	0.0	resting
2024-05-27 13:42:00	True	0 days 00:00:30	sitting-lying	0	False	7034082.678059	569292.481454	11.48	1.394951	56.900002	0.167394	6.843455	outdoor	sedentary	0.0	resting
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2024-05-30 08:36:15	True	0 days 00:00:30	sitting-lying	0	True	7034502.980077	569671.299592	48.216	7.80086	53.100002	0.936103	233.289398	indoor	sedentary	0.0	resting
2024-05-30 08:36:30	True	0 days 00:00:15	sitting-lying	0	True	7034508.325039	569660.845259	24.764	11.741452	52.100002	2.817949	202.85585	outdoor	sedentary	0.0	resting
2024-05-30 08:36:45	True	0 days 00:00:15	sitting-lying	0	True	7034506.158289	569677.000718	44.108002	16.300112	52.600002	3.912027	73.518585	indoor	sedentary	0.01	resting
2024-05-30 08:37:00	True	0 days 00:00:15	sitting-lying	0	True	7034501.701539	569667.868208	34.206001	10.161957	53.100002	2.438869	329.253082	indoor	sedentary	0.54	resting
2024-05-30 08:37:30	True	0 days 00:00:30	sitting-lying	0	True	7034506.250361	569660.261525	24.555	8.863036	52.100002	1.063564	198.332748	outdoor	sedentary	0.0	resting

10183 rows × 16 columns

5. Processing¶

With the datasets successfully merged and expanded, we can now proceed with the analysis phase to extract valuable insights.

5.1 Check Daily Wear Time¶

Before we take the next step and analyze your data, let's make sure your data is ready! It's important to check if you wore the sensors consistently throughout the academy program. Complete sensor data helps us get the most accurate picture of your progress. If there are any missing days, we can decide together how to handle them. We might need to exclude you from the analysis, or we can focus on the days with sensor data.

In [26]:

Copied!

imported_subject.wear_time
imported_subject.wear_time

Out[26]:

	wear_time	day
datetime
2024-05-27	0 days 06:18:15	Monday
2024-05-28	0 days 14:55:45	Tuesday
2024-05-29	0 days 14:39:00	Wednesday
2024-05-30	0 days 06:32:45	Thursday

5.2 Detect Trips and Transportation Mode¶

Alright, let's take it one step at a time. First things first, let's visualize the GPS data on a map.

In [27]:

Copied!

imported_subject.plot("gps")
imported_subject.plot("gps")

Out[27]:

Let's unlock the mystery behind these data points! These might seem like simple dots, but with our new rule-based algorithm, we can turn them into a detailed travel log. This powerful tool utilizes data from both sensors to identify trips the subject took and even classify them as walking, cycling, or vehicle travel. Ready to see where the journey leads?

Note: Our algorithm might not be the most revolutionary tool out there, but it effectively addresses our needs. We developed it specifically to tackle our challenges, and while there might be more intricate options available, this one serves us well. One of the key strengths of our algorithm is that it's already been validated for accuracy. This ensures we can trust the data it provides. But here's what truly sets it apart: it's modular!

Basic functionality: Even with just GPS data (latitude and longitude), the algorithm can identify trips you've taken. This provides a valuable starting point for analysis.
Enhanced accuracy: If you can provide information about whether a GPS point was collected indoors or outdoors, the algorithm can further refine its precision.
Transportation mode detection: The algorithm can classify your travel mode – walking, cycling, or vehicle travel – based on speed data derived from GPS. Including accelerometer data alongside GPS data significantly improves the accuracy of this detection.
Customization: And the best part? You have the flexibility to fine-tune the parameters of the function to match your specific needs. This allows you to tailor the algorithm's behavior to your unique situation.

In [28]:

Copied!





from datetime import timedelta

from labda.assets import TRANSPORTATION_CUT_POINTS

trip_params = {
    "cut_points": TRANSPORTATION_CUT_POINTS[0],
    "window": 3,
    "gap_duration": timedelta(minutes=1),
    "stop_radius": 25,
    "stop_duration": timedelta(minutes=3),
    "pause_radius": None,
    "pause_duration": timedelta(minutes=1.5),
    "min_duration": timedelta(minutes=1),
    "min_length": None,
    "min_distance": None,
    "max_speed": None,
    "activity": True,
    "indoor_limit": 60,
}
imported_subject.detect_trips(**trip_params, overwrite=True)
from datetime import timedelta

from labda.assets import TRANSPORTATION_CUT_POINTS

trip_params = {
    "cut_points": TRANSPORTATION_CUT_POINTS[0],
    "window": 3,
    "gap_duration": timedelta(minutes=1),
    "stop_radius": 25,
    "stop_duration": timedelta(minutes=3),
    "pause_radius": None,
    "pause_duration": timedelta(minutes=1.5),
    "min_duration": timedelta(minutes=1),
    "min_length": None,
    "min_distance": None,
    "max_speed": None,
    "activity": True,
    "indoor_limit": 60,
}
imported_subject.detect_trips(**trip_params, overwrite=True)

2024-05-30 08:50:59 | INFO | labda.structure.subject.detect_trips | 759DE5 | Number of trips detected: 35.

Once you've completed this analysis, you'll gain access to a detailed timeline of the subject's journeys, including their mode of transportation. This rich data becomes a valuable resource for researchers, allowing them to tailor further analysis to their specific research questions.

In [29]:

Copied!

# Timeline is still work in progress and may not work as expected.

imported_subject.timeline
# Timeline is still work in progress and may not work as expected.

imported_subject.timeline

Out[29]:

	segment_id	trip_id	stationary_id	start	end	duration	distance	status	mode	geometry
0	1	<NA>	1	2024-05-27 13:40:30	2024-05-27 14:35:30	0 days 00:55:00	79.250679	stationary	NaN	MULTIPOINT (569292.1147669987 7034083.47259706...
1	2	<NA>	2	2024-05-27 14:40:45	2024-05-27 14:42:30	0 days 00:01:45	4.439639	stationary	NaN	MULTIPOINT (569292.701252398 7034083.362708623...
2	3	<NA>	3	2024-05-27 14:51:45	2024-05-27 14:54:45	0 days 00:03:00	2.762089	stationary	NaN	MULTIPOINT (569292.5579101339 7034083.52678611...
3	4	<NA>	4	2024-05-27 15:17:00	2024-05-27 15:19:30	0 days 00:02:30	4.279222	stationary	NaN	MULTIPOINT (569291.9541997944 7034083.97067090...
4	5	<NA>	5	2024-05-27 15:43:00	2024-05-27 15:45:45	0 days 00:02:45	6.461108	stationary	NaN	MULTIPOINT (569292.4551276335 7034084.58339547...
...	...	...	...	...	...	...	...	...	...	...
138	65	<NA>	98	2024-05-30 07:16:45	2024-05-30 07:21:30	0 days 00:04:45	104.189933	stationary	NaN	MULTIPOINT (569641.3013998132 7034509.30346295...
139	65	34	<NA>	2024-05-30 07:21:30	2024-05-30 07:23:15	0 days 00:01:45	125.659603	transport	walk/run	LINESTRING (569652.835513061 7034509.610548773...
140	65	<NA>	99	2024-05-30 07:23:15	2024-05-30 07:39:30	0 days 00:16:15	349.613901	stationary	NaN	MULTIPOINT (569630.1111897109 7034494.81544013...
141	65	35	<NA>	2024-05-30 07:39:30	2024-05-30 08:15:45	0 days 00:36:15	4648.213327	transport	walk/run	LINESTRING (569657.6105424208 7034485.88511684...
142	65	<NA>	100	2024-05-30 08:15:45	2024-05-30 08:37:30	0 days 00:21:45	265.135945	stationary	NaN	MULTIPOINT (569631.1772371926 7034491.24975219...

143 rows × 10 columns

And to top it all off, we offer basic trip visualization to complement your analysis.

In [30]:

Copied!

imported_subject.plot("timeline")
imported_subject.plot("timeline")

Out[30]:

Join the Movement¶

We're looking for help with development. Would you be interested in collaborating?
We'd love to incorporate your algorithms into the project. Would you be willing to share them?
To ensure the project meets your needs, we'd appreciate your input on desired functionalities.
If you're interested in testing the package, we'd welcome your participation.
To create clear documentation, we'd love your help as a documentation writer.

Interested in collaboration? Let's discuss potential avenues for working together! Feel free to approach me directly, or reach out via email at [email protected]. Additionally, we maintain an active Discord server – you're welcome to join us there as well.