<
From version < 10.1 >
edited by Andreas Isking
on 2025/02/28 18:15
To version < 8.1 >
edited by AISOP Admin
on 2025/01/14 21:22
>
Change comment: There is no comment for this version

Summary

Details

Page properties
Author
... ... @@ -1,1 +1,1 @@
1 -XWiki.andisk
1 +XWiki.AISOPAdmin
Content
... ... @@ -57,97 +57,6 @@
57 57  * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
58 58  * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
59 59  
60 -
61 -==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
62 -
63 -Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
64 -
65 -**Prerequisites**
66 -
67 -~1. Tesseract must be installed
68 -
69 -(% class="box" %)
70 -(((
71 -##tesseract ~-~-version##
72 -)))
73 -
74 -2. Poppler must be installed
75 -
76 -(% class="box" %)
77 -(((
78 -##brew install poppler##
79 -)))
80 -
81 -**Code**
82 -
83 -(% class="box" %)
84 -(((
85 -##for pdf in *.pdf; do
86 -# Extract the base name of the PDF without the extension
87 -basename="${pdf%.pdf}"
88 -\\# Convert PDF to PNGs
89 -pdftoppm -png "$pdf" "$basename"
90 -\\# Create a text file with the same name as the PDF.
91 -for png in "$basename"-*.png; do
92 - tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
93 - echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
94 -done
95 -done
96 -rm *.png##
97 -)))
98 -
99 -**Explanation**
100 -
101 -~1. ##for pdf in *.pdf; do ...; done##
102 -
103 -• Loops through all PDF files in the directory.
104 -
105 -2. ##basename="${pdf%.pdf}"##
106 -
107 -• Extracts the filename of the PDF without the .pdf extension.
108 -
109 -3. ##pdftoppm -png "$pdf" "$basename"##
110 -
111 -• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
112 -
113 -4. ##for png in "$basename"-*.png; do ...; done##
114 -
115 -• Processes only the PNG files generated from the current PDF.
116 -
117 -5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
118 -
119 -• Performs OCR on the PNG file.
120 -
121 -6. ##tr '\n' ' '##
122 -
123 -• Replaces line breaks with spaces.
124 -
125 -7. ##sed 's/  */ /g'##
126 -
127 -• Reduces multiple spaces to a single space.
128 -
129 -8. ##>> "$basename.txt"##
130 -
131 -• Appends the recognized text to a text file with the same name as the PDF.
132 -
133 -9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
134 -
135 -• Adds a separator line (~-~--) after each page.
136 -
137 -**Result**
138 -
139 -• A separate text file is created for each PDF, e.g.:
140 -
141 -• file1.txt for file1.pdf
142 -
143 -• file2.txt for file2.pdf
144 -
145 -• The OCR results of all pages from the respective PDF are written into this text file.
146 -
147 -• Each page is separated by a separator line (~-~--).
148 -
149 -• Temporary PNG files are deleted at the end.
150 -
151 151  === 1.3: Annotate Text Fragments ===
152 152  
153 153  It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.
... ... @@ -155,8 +155,7 @@
155 155  **The first step: top-level-labels:** This is the simple [["text classifier" recipe>>https://prodi.gy/docs/recipes#textcat]] of prodigy: we can invoke the following command for this: ##prodigy textcat.manual the-course-name-l1 ./fragments.jsonl ~-~-label labels-depth1.txt##  which will offer a web-interface on which each fragment is annotated with the (top-level) label. This web-interface can be left running for several days.
156 156  Then extract the content into a file: ##prodigy db-out the-course-name-l1 > the-course-name-dbout.jsonl##
157 157  
158 -**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 \
159 - the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
67 +**The second step is the hierarchical annotation** [[custom recipe>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/hierarchical_annotation?ref_type=heads]] (link to become public soon): The same fragments are now annotated with the top-level annotation and all their children. E.g. using the command ##python -m prodigy subcat_annotate_with_top2 the-course-name-l2 the-course-name-dbout.jsonl labels-all-depths.txt  -F ./subcat_annotate_with_top2.py##.
160 160  
161 161  The resulting data-set can be extracted out of prodigy using the db-out recipe, e.g. prodigy db-out the-course-name-l2 the-course-name-l2-dbout or can be converted to a spaCy dataset for training e.g. using the command xxxxx (see [[here>>https://gitlab.com/aisop/aisop-nlp/-/tree/main/it3/fundamental-principles]])
162 162  

Need help?

If you need help with XWiki you can contact: