vault backup: 2025-08-29 12:47:00

2025-08-29 12:47:00 -04:00
parent c1bdb06c6a
commit f831a58d53
6 changed files with 108 additions and 22 deletions
@@ -15,3 +15,48 @@ See [[portable-tools]] for valid dependencies.
 Right now I'm exporting Bluebeam markups to csv before processing,
 however if I converted to the code to extract the markups directly with MuPDF.Net
 as I've managed before with itext, that could save a step.
+
+## PDF Content Positional Tokenization
+
+Recursively parse and consume pdf vector content.
+
+> [!example]
+> A span with text "GFCI" is consumed.
+> (draw calls are removed from page content)
+> A `gfci_label` token is created
+> and encoded with the span's position.
+
+> [!example]
+> A `duplex_receptacle` token and a `gfci_label` token
+> in close proximity are consumed
+> creating a `duplex_gfci_receptacle` token
+> which inherits the `duplex_receptacle`'s position.
+
+```
+$ mutool show file.pdf pages/1/Contents
+
+629 0 obj
+<<
+  /Filter /FlateDecode
+  /Length 31375
+>>
+stream
+q
+0.12 0 0 0.12 0 0 cm
+/R8 gs
+/R9 gs
+2 w
+1 J
+1 j
+0 G
+q
+4217 3947 m
+4217 3879 l
+4467 3879 l
+4467 3947 l
+4455 3947 l
+4455 3891 l
+4230 3891 l
+4230 3947 l
+...
+```