# Seattle 2021 Prototype Dataset

This is a proof-of-concept dataset for testing the viability of
speaker annotation from audio classification.

## Dataset Structure

```
speakerbox/data/seattle-2021-proto/
├── annotations
│   ├── 23333c839436.json
│   ├── 32814deedec2.json
│   ├── 5e881a137b6d.json
│   ├── 79fb834ac65c.json
│   └── 7fe4c0d99b44.json
├── extra
│   └── ex-01.json
├── cleaner.py
├── metadata.json
└── README.md
```

### Annotations

Each file in the annotations directory is named with the active
CDP Seattle-Staging event id (as the event id is produced by `cdp-backend==v3.0.3`).

For example, if you want to navigate to the CDP event page for the selected,
events you can fill in the id at the end of the following URL:

```
http://councildataproject.org/seattle-staging/#/events/{id}
```

Such as:

```
http://councildataproject.org/seattle-staging/#/events/23333c839436
```

Each annotation file was produced using the
[gecko annotation tool](https://gong-io.github.io/gecko/).

And followed this
[annotation procedure](https://docs.google.com/document/d/17lngsKv2mZGx6jKK8uXp5c3LbyAwi9iFPoOd3vhw6d8/edit?usp=sharing).

### Extras

Until we reprocess all of the city council meetings covered by the annotated data,
we will keep their annotations separate from the annotation files.

### cleaner.py

Any updates or cleaning done to the annotation files will live in this Python script.

### metadata.json

This file will be used to store metadata about each annotation file that may be useful
for provenance information.

##### old-event-id

Currently this file stores a lookup for the `old-event-id` for each annotation file
due to the fact that during the process of annotating this data, we reprocessed many of
our events due to a
[hashing bug that affected event id generation](https://github.com/CouncilDataProject/cdp-backend/pull/144).

When looking at the annotation procedure, if you see an event id that does not match
the filenames stored in this dataset, this lookup stores the appropriate match.
