À propos
The full process is already documented step-by-step in the attached project brief. READ IT.
Project brief
You will be replicating the original workflow described there, but with one important adjustment:
The new version must account for Part of Speech (POS) using the parsed CoNLL-U format (instead of counting lemmas only).
In other words, the counting logic must distinguish between lemma + POS combinations where relevant.
What Needs to Be Done:
Download and work with the latest OpenSubtitles2024 Hebrew corpus (parsed version).
Replicate the original pipeline:
1. Deduplication (single file per movie folder)
2. Lemma extraction
3. Frequency calculation (raw + normalized)
4. Range calculation
5. Dispersion (Gries' UDP)
6. Sorting by UDP
7. Export final TSV list
8. Modify the counting logic to: Use parsed CoNLL-U files + Include POS in the counting logic (instead of lemma-only counting)
9. Deliver a final frequency list of 4 thousand to 5 thousand items, sorted by UDP, including:
1/ Lemma,
2/ Rank,
3/ UDP,
4/ Normalized frequency
5/ Range
You are not required to redesign the methodology - just correctly implement and adjust the documented process.
Important – To Apply
To ensure you have read and understood the brief:
1. Briefly describe (in your own words) how you would approach this project technically.
2. Explain specifically how you would modify the original lemma-counting logic to incorporate POS from CoNLL-U files.
3. Estimate how long this would take you.
4. Provide your fixed project price.
Applications that do not answer these points will not be considered.
Ideal Candidate
- Strong Python experience
- Comfortable working with large corpora
Experience with:
- CoNLL-U format
- NLP pipelines
- Parsing structured linguistic data
- Able to implement mathematical measures (e.g., dispersion / UDP)
This is a fixed-price project. Please apply with your proposed total price and timeline.
Contract duration of 1 to 3 months. with 30 hours per week.
Mandatory skills: Python Script, Data Mining, Natural Language Processing, Python, Data Processing, Data Cleaning
Compétences linguistiques
- English
Avis aux utilisateurs
Cette offre provient d’une plateforme partenaire de TieTalent. Cliquez sur « Postuler maintenant » pour soumettre votre candidature directement sur leur site.