<
From version < 10.1 >
edited by Andreas Isking
on 2025/02/28 18:15
To version < 9.1 >
edited by AISOP Admin
on 2025/01/14 21:23
>
Change comment: There is no comment for this version

Summary

Details

Page properties
Author
... ... @@ -1,1 +1,1 @@
1 -XWiki.andisk
1 +XWiki.AISOPAdmin
Content
... ... @@ -57,97 +57,6 @@
57 57  * The least amount of content to be extracted is the complete set of slides and their comments. We recommend to use past students' e-portfolios too. We had rather good experience with about 1000 fragments for a course.
58 58  * If interrupted, the process may create several JSON files. You can combine them using the [[merge-tool>>https://gitlab.com/aisop/aisop-hacking/-/tree/main/merge-json-files]].
59 59  
60 -
61 -==== 1.2.1: Extract Text from PDF or PNG (PDF → PNG → Text) ====
62 -
63 -Text extraction from PDFs is sometimes faulty. Additionally, many PDFs contain images. To capture this text, Tesseract can be used. A brief explanation of how to use it is provided here.
64 -
65 -**Prerequisites**
66 -
67 -~1. Tesseract must be installed
68 -
69 -(% class="box" %)
70 -(((
71 -##tesseract ~-~-version##
72 -)))
73 -
74 -2. Poppler must be installed
75 -
76 -(% class="box" %)
77 -(((
78 -##brew install poppler##
79 -)))
80 -
81 -**Code**
82 -
83 -(% class="box" %)
84 -(((
85 -##for pdf in *.pdf; do
86 -# Extract the base name of the PDF without the extension
87 -basename="${pdf%.pdf}"
88 -\\# Convert PDF to PNGs
89 -pdftoppm -png "$pdf" "$basename"
90 -\\# Create a text file with the same name as the PDF.
91 -for png in "$basename"-*.png; do
92 - tesseract "$png" stdout -l deu ~-~-oem 1 | tr '~\~\n' ' ' | sed 's/  */ /g' >> "$basename.txt"
93 - echo -e "~\~\n~\~\n~-~--~\~\n~\~\n" >> "$basename.txt"
94 -done
95 -done
96 -rm *.png##
97 -)))
98 -
99 -**Explanation**
100 -
101 -~1. ##for pdf in *.pdf; do ...; done##
102 -
103 -• Loops through all PDF files in the directory.
104 -
105 -2. ##basename="${pdf%.pdf}"##
106 -
107 -• Extracts the filename of the PDF without the .pdf extension.
108 -
109 -3. ##pdftoppm -png "$pdf" "$basename"##
110 -
111 -• Converts the PDF into PNG images, named in the format BASENAME-1.png, BASENAME-2.png, etc.
112 -
113 -4. ##for png in "$basename"-*.png; do ...; done##
114 -
115 -• Processes only the PNG files generated from the current PDF.
116 -
117 -5. ##tesseract "$png" stdout -l deu ~-~-oem 1##
118 -
119 -• Performs OCR on the PNG file.
120 -
121 -6. ##tr '\n' ' '##
122 -
123 -• Replaces line breaks with spaces.
124 -
125 -7. ##sed 's/  */ /g'##
126 -
127 -• Reduces multiple spaces to a single space.
128 -
129 -8. ##>> "$basename.txt"##
130 -
131 -• Appends the recognized text to a text file with the same name as the PDF.
132 -
133 -9. ##echo -e "\n\n~-~--\n\n" >> "$basename.txt"##
134 -
135 -• Adds a separator line (~-~--) after each page.
136 -
137 -**Result**
138 -
139 -• A separate text file is created for each PDF, e.g.:
140 -
141 -• file1.txt for file1.pdf
142 -
143 -• file2.txt for file2.pdf
144 -
145 -• The OCR results of all pages from the respective PDF are written into this text file.
146 -
147 -• Each page is separated by a separator line (~-~--).
148 -
149 -• Temporary PNG files are deleted at the end.
150 -
151 151  === 1.3: Annotate Text Fragments ===
152 152  
153 153  It is time to endow the fragments with topics so that we can recognize students' paragraphs' topics. In AISOP, we have used the (commercial) [[prodigy>>https://prodi.gy/]] for this task in two steps which, both, iterate through all fragments to give them topics.

Need help?

If you need help with XWiki you can contact: