Skip to content

Commit 6985914

Browse files
committed
final cleanup
1 parent c13a74d commit 6985914

File tree

2 files changed

+4
-4
lines changed

2 files changed

+4
-4
lines changed

content/blog/2022-12-01-epidata-v4.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,10 @@ This work was driven by the necessity to fix several specific deficiencies in th
4040
### 2. Data Ingestion/Repair and Development
4141

4242
* ``is_latest`` column:
43-
+ Because v3 recorded which row represented the most up-to-date value among a set of revisions using a flag, this flag had to be meticulously maintained to ensure that exactly one row for each ``(source,`` ``signal,`` ``time_type,`` ``time_value,`` ``geo_type,`` ``geo_value)`` tuple had the ``is_latest`` flag set. This flag had to be unset when newer data was added, but left in place when older data was patched. System faults while data was being loaded could leave the records in an inconsistent state where two or more, zero, or the wrong rows had the flag set. Detecting problems with this flag was computationally expensive.
43+
+ Because v3 recorded which row represented the most up-to-date value among a set of revisions using a flag, this flag had to be meticulously maintained to ensure that exactly one row for each (``source``, ``signal``, ``time_type``, ``time_value``, ``geo_type``, ``geo_value``) tuple had the ``is_latest`` flag set. This flag had to be unset when newer data was added, but left in place when older data was patched. System faults while data was being loaded could leave the records in an inconsistent state where two or more, zero, or the wrong rows had the flag set. Detecting problems with this flag was computationally expensive.
4444

4545
* Reduce repetition/normalize
46-
+ The v3 table included the full text of the source, signal, geographic type, and geographic value for each row, such that we were storing millions of copies of identical information for rows that shared the same ``(source,`` ``signal,`` ``geo_type,`` ``geo_value)``. Since the duplicated information was also the most common selection criteria, the indexes took up a disproportionate amount of space on disk relative to the size of the base dataset.
46+
+ The v3 table included the full text of the source, signal, geographic type, and geographic value for each row, such that we were storing millions of copies of identical information for rows that shared the same (``source``, ``signal``, ``geo_type``, ``geo_value``). Since the duplicated information was also the most common selection criteria, the indexes took up a disproportionate amount of space on disk relative to the size of the base dataset.
4747

4848
### 3. Systems Maintenance
4949

content/blog/2022-12-01-epidata-v4.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,11 @@ <h3>2. Data Ingestion/Repair and Development</h3>
6060
<ul>
6161
<li><code>is_latest</code> column:
6262
<ul>
63-
<li>Because v3 recorded which row represented the most up-to-date value among a set of revisions using a flag, this flag had to be meticulously maintained to ensure that exactly one row for each <code>(source,</code> <code>signal,</code> <code>time_type,</code> <code>time_value,</code> <code>geo_type,</code> <code>geo_value)</code> tuple had the <code>is_latest</code> flag set. This flag had to be unset when newer data was added, but left in place when older data was patched. System faults while data was being loaded could leave the records in an inconsistent state where two or more, zero, or the wrong rows had the flag set. Detecting problems with this flag was computationally expensive.</li>
63+
<li>Because v3 recorded which row represented the most up-to-date value among a set of revisions using a flag, this flag had to be meticulously maintained to ensure that exactly one row for each (<code>source</code>, <code>signal</code>, <code>time_type</code>, <code>time_value</code>, <code>geo_type</code>, <code>geo_value</code>) tuple had the <code>is_latest</code> flag set. This flag had to be unset when newer data was added, but left in place when older data was patched. System faults while data was being loaded could leave the records in an inconsistent state where two or more, zero, or the wrong rows had the flag set. Detecting problems with this flag was computationally expensive.</li>
6464
</ul></li>
6565
<li>Reduce repetition/normalize
6666
<ul>
67-
<li>The v3 table included the full text of the source, signal, geographic type, and geographic value for each row, such that we were storing millions of copies of identical information for rows that shared the same <code>(source,</code> <code>signal,</code> <code>geo_type,</code> <code>geo_value)</code>. Since the duplicated information was also the most common selection criteria, the indexes took up a disproportionate amount of space on disk relative to the size of the base dataset.</li>
67+
<li>The v3 table included the full text of the source, signal, geographic type, and geographic value for each row, such that we were storing millions of copies of identical information for rows that shared the same (<code>source</code>, <code>signal</code>, <code>geo_type</code>, <code>geo_value</code>). Since the duplicated information was also the most common selection criteria, the indexes took up a disproportionate amount of space on disk relative to the size of the base dataset.</li>
6868
</ul></li>
6969
</ul>
7070
</div>

0 commit comments

Comments
 (0)