OSM Town Compass @compass

0 posts0 participants0 posts today

**Greg Cocks** @GregCocks@techhub.social · Apr 11 *

Greg Cocks @GregCocks@techhub.social

The Diverse Mountainous Landslide Dataset (DMLD) - A High-Resolution Remote Sensing Landslide Dataset in Diverse Mountainous Regions
--
https://doi.org/10.3390/rs16111886 <-- shared paper
--
#GIS #spatial #mapping #landslide #dataset #detection #deeplearning #remotesensing #highresolution #earthobservation #massmovement #geology #engineeringgeology #risk #hazard #naturaldisaster #machinelearning #AI #mountain #mountainous #region #China #Yunnan #casestudy #geomorphlogy #geomorphometry #India #NewZealand #Kodagu #Kaikoura #cost #economics #infrastructure #publicsafety #spatialanalysis #model #modeling

**Dan Stowell** @danstowell@mastodon.social · Mar 20

Mar 20

Dan Stowell @danstowell@mastodon.social

New preprint: "InsectSet459: an open dataset of insect sounds for bioacoustic machine learning" https://arxiv.org/abs/2503.15074 #bioacoustics #dataset #insects #orthoptera #cicadidae

arXiv.orgInsectSet459: an open dataset of insect sounds for bioacoustic machine learningAutomatic recognition of insect sound could help us understand changing biodiversity trends around the world -- but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae. It is the first large-scale dataset of insect sound that is easily applicable for developing novel deep-learning methods. Its recordings were made with a variety of audio recorders using varying sample rates to capture the extremely broad range of frequencies that insects produce. We benchmark performance with two state-of-the-art deep learning classifiers, demonstrating good performance but also significant room for improvement in acoustic insect classification. This dataset can serve as a realistic test case for implementing insect monitoring workflows, and as a challenging basis for the development of audio representation methods that can handle highly variable frequencies and/or sample rates.

**Christoph Deppe** @krycztof@mastodon.social · Mar 20

Mar 20

Christoph Deppe @krycztof@mastodon.social

The #euvsdisinfo #dataset is a fantastic resource—well-structured, highly useful, and a great example of proactive #data sharing in the #research of #disinformation. I’ll be using it in an upcoming research project!

That said, the data collection #methodology remains a bit opaque. If anyone has insights on how the data is gathered, I’d love to hear more!

https://euvsdisinfo.eu/

EUvsDisinfoHome page - EUvsDisinfoEUvsDisinfo is the flagship project of the European External Action Service’s East StratCom Task Force.It was established in 2015 to better forecast, address, and respond to the Russian Federation’s ongoing disinformation campaigns affecting the European Union, its Member States, and countries in the shared neighbourhood.

#OSINT #DataTransparency #InfoOps

**Alexandre Dulaunoy** @a@paperbay.org · Feb 23

Feb 23

Alexandre Dulaunoy @a@paperbay.org

I was looking for a parseable Wiktionary dump and discovered Kaikki.org, a digital archive and data mining group. They offer a massive, parseable dataset in JSONL format.

https://kaikki.org/dictionary/rawdata.html

#opendata #opensource #wiktionary #dataset #datamining #ai #ml

@wikimediafoundation

kaikki.orgRaw data downloads extracted from Wiktionary

**Alexandre Dulaunoy** @adulau@infosec.exchange · Feb 22

Feb 22

Alexandre Dulaunoy @adulau@infosec.exchange

We imported the data from Black Basta Ransomware group leak into AIL and there are many interesting aspects.

The federation network of Matrix servers (see the screenshot) used to communicated among the affiliates/group(s).
Activities in the chat room, especially the daily activity view in AIL. Guessing the location and timezone of groups or affiliates is an endless source of information.
They rely on many open-source and SaaS tools, including Google Docs or Zoom.
Many interesting correlations with cryptocurrencies, IP addresses, CVE numbers, and chat username relationships (who talks to whom and when).

If you are using AIL project and want to import the leak dataset, @terrtia did an importer https://github.com/ail-project/ail-feeder-matrix

#BlackBasta #blackbastleaks #threatintel #osint #threatintelligence #opensource #dataset

@ail_project

Maybe some interesting input for @fr0gger for his existing analysis.

I see that this dataset can be used to enhance some of our open-source tools.

https://github.com/ail-project/ail-framework

Lists of Matrix server references involved in the Black Basta ransomware group leak. The data has been imported to AIL.

Activities in the chat room, especially the daily activity view in AIL.

Many interesting correlations with cryptocurrencies, IP addresses, CVE numbers, and chat username relationships (who talks to whom and when).

**Jumping Langur** @jumpinglangur@mastodon.social · Feb 18

Feb 18

Jumping Langur @jumpinglangur@mastodon.social

We just published a new virtual texture dataset for #GaiaSky: 32K Moon Topography (NASA SVS). Find out more here:
https://gaiasky.space/news/2025/moon-vt-32k

The new 32K moon topography dataset in action.

#Moon #VirtualTexture #Dataset

**openSUSE Linux** @opensuse@fosstodon.org · Feb 12

Feb 12

openSUSE Linux @opensuse@fosstodon.org

#AI meets #opensource licensing! The new #Cavil Legal Text #dataset helps automate #legal text classification, improves #compliance & transparency in #software. #openSUSE https://news.opensuse.org/2025/02/12/os-licensing-gets-ai-upgrade/

openSUSE NewsOpen-Source Licensing Gets AI UpgradeDevelopers of the openSUSE community continue their commitment toward improving legal compliance and software transparency with the release of the Cavil Lega...

Continued thread

**FINOkoye** @FINOkoye@mastodon.social · Jan 19 *

Jan 19 *

FINOkoye @FINOkoye@mastodon.social

OK ended up using OSF as I have quite a few things on tonight's checklist to get on with and tbh I don't think I really understand enough to understand any potential advice!

I have set up a data repository here. Would be lovely if someone, anyone out there could check that both the excel file and the csv can be viewed/downloaded!

https://osf.io/bjfhs/

OSFExploring the Black Atlantic This spreadsheet contains data points used in the the Future Legacies 'Black Atlantic' storymap. The storymap aims to educate readers about the concept of the Black Atlantic through items from different GLAM collections and Digital Humanities projects. Hosted on the Open Science Framework

#Academia #dataset

**FINOkoye** @FINOkoye@mastodon.social · Jan 19

Jan 19

FINOkoye @FINOkoye@mastodon.social

#Academic #Dataset related question!

OK so I'm uploading a dataset to zenodo and it has a field called 'alternate identifier'. I don't really understand the answers from duckduckgo search - anyone able to give me a 101?

**Anand Philip** @anandphilipc@sigmoid.social · Jan 13 *

Jan 13 *

Anand Philip @anandphilipc@sigmoid.social

This dude https://shijith.com/ Has amazing #dataviz and #dataset projects using public data from india. https://github.com/shijithpk/wikipedia_abuse_checker This is a script that collects the most abused wikipedia pages. Here's one that looked at the panchayath's in kerala that have heated up the most in the last five years. https://github.com/shijithpk/hottest-panchayats-kerala he shares code and does good writeups on the blog. good stuff! #DataVizIndia

shijith.comShijith.com

**d'aïeux et d'ailleurs** @daieuxetdailleurs@framapiaf.org · Jan 11

Jan 11

d'aïeux et d'ailleurs @daieuxetdailleurs@framapiaf.org

RT @SandraBree 8 nouveaux jeux de données sur l'histoire administrative des communes de France disponibles sur #Progedo : https://data.progedo.fr/series/adisp/paroisses-et-communes-de-france

Post LinkedIn annonçant la mise en ligne de 8 nouveaux jeux de données documentant l'histoire administrative et démographique des communes en France sur Progedo, illustré par une carte de France choroplèthe et la couverture d'édition de "Paroisses et communes de France"

#dataSHS #histoire #France

**Greg Cocks** @GregCocks@techhub.social · Jan 7

Jan 7

Greg Cocks @GregCocks@techhub.social

Elevation-Derived Hydrography [EDH] - The USGS’s Rich New Hydrological Features Dataset
--
https://doi.org/10.2489/jswc.2024.0314A <-- shared paper
--
https://pubs.usgs.gov/publication/tm11B12 <-- USGS EDH Representation, Extraction, Attribution, and Delineation Rules reference publication
--
https://www.usgs.gov/3d-hydrography-program <-- shared link to the USGS 3DHP page
--
[in my role, I have the pleasure of working with the valuable EDH process(es) and the data it produces on a daily basis]
#GIS #spatial #mapping #water #hydrology #hydrography #3dep #edh #3dhp #elevationderivedhydrography #opendata #elevation #dem #dtm #interpretation #waterfeatures #usecase #waterresources #floodmodeling #alignment #model #modeling #dataset #naturalresources #costs #benefits #economics #businessuse #publicdata #spatialanalysis #USA #USGS
@USGS

map - Geomorphon index data overlaid on hillshade data for Elevation-Derived Hydrography production

maps - (a) Using contours and (b) geomorphon index to validate and refine stream location in Elevation-Derived Hydrography dataset.

maps - Progression of developing derived stream features from a Digital Elevation Model (left to right).

LiDAR base map / image - stream and road, false colour

Continued thread

**Dimitris Kontopoulos** @DGKontopoulos@ecoevo.social · Jan 6

Jan 6

Dimitris Kontopoulos @DGKontopoulos@ecoevo.social

To address 3 key questions regarding #torpor #evolution, we compiled a #dataset of a) torpor capabilities and b) 21 ecophysiological variables for 1,338 species of #mammals and #birds.

We then analysed this dataset using a series of #phylogenetic comparative methods.

5/12

**Christof Schöch** @christof@fedihum.org · Dec 18, 2024

Dec 18, 2024

Christof Schöch @christof@fedihum.org

The "Korean Journal of Digital Humanities" (under the aegis of #KADH) has just published a new issue! https://accesson.kr/kjdh/v.1/2/2024

Wide range of topics, including a short story #dataset, an analysis of #Fitzgerald, #newspaper, #networks, #morphology...

... and, intriguingly and curiously, an interview with myself.

accesson.krKorean Journal of Digital Humanities, Korean Association for Digital HumanitiesKorean Journal of Digital Humanities, Korean Association for Digital Humanities

Continued thread

**Digital Scholarship at the BL** @BL_DigiSchol@techhub.social · Dec 16, 2024

Dec 16, 2024

Digital Scholarship at the BL @BL_DigiSchol@techhub.social

New dataset from BL colleague Alex Hailey: Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records; related datacard and blog posts https://doi.org/10.23636/gfva-yn20 #Transkribus #dataset #GroundTruth #transcription

British Library Ground truth transcriptions of 18th &19th century English language documents relating to botany from the India Office RecordsGround truth transcriptions of 18th &19th century English language documents relating to botany from the India Office Records

**DW Innovation** @dw_innovation@mastodon.social · Dec 12, 2024

Dec 12, 2024

DW Innovation @dw_innovation@mastodon.social

Have you heard about Common Corpus?

It's the largest open and permissible licensed text #dataset, comprising over 2 trillion (!) tokens.

CC is a diverse dataset that features books, newspaper and scientific articles, government and legal docs, code, and other assets.

https://huggingface.co/datasets/PleIAs/common_corpus

huggingface.coPleIAs/common_corpus · Datasets at Hugging FaceWe’re on a journey to advance and democratize artificial intelligence through open source and open science.

**Kathy Reid** @KathyReid@aus.social · Dec 12, 2024

Dec 12, 2024

Kathy Reid @KathyReid@aus.social

The Mozilla #CommonVoice #dataset v20 was released yesterday - the largest open #speech dataset in the world. My #dataviz, linked below, shows a continuation of patterns seen for some years now:

There's more data collected for #Catalan (ca) than for #English (en) - testament to the independence and language reclamation efforts in Catalunya. Language and cultural transmission are deeply intertwined.

Some of the newer #languages to Common Voice, like #Ligurian / #Genoese (lij) have contributions from mostly older speakers, which is unusual in comparison to the rest of the dataset. This may reflect the population that currently speak those languages - as many regional languages in Italy are in rapid decline.

Some languages such as Eastern Mari / Meadow Mari (mhr) - a #Uralic language spoken in the Mari-El Republic within Russia - have samples from predominantly female-identifying speakers, again contrasting to the rest of the dataset. Other languages here include #Cantonese (yue), #Georgian (ka), and #Kalenjin (kln).

A key part in the preparation of the Common Voice dataset is the validation of utterances to assure they match their written transcription - which requires at least two validations by separate speakers. Some newer languages to Common Voice, such as Erzya (myv) and Moksha (mdf), both Uralic languages, have nearly 100% validation.

What are your interpretations of the dataset?

https://observablehq.com/@kathyreid/mozilla-common-voice-v20-dataset-metadata-coverage

Screenshot of Mozilla Common Voice v20 splits by gender as a horizontal bar chart

**Benjamin Carr, Ph.D.** @BenjaminHCCarr@hachyderm.io · Nov 16, 2024

Nov 16, 2024

Benjamin Carr, Ph.D. @BenjaminHCCarr@hachyderm.io

How a stubborn #computerscientist accidentally launched the #deeplearning boom
"You’ve taken this idea way too far," a mentor told Prof. Fei-Fei Li, who was creating a new image #dataset that would be far larger than any that had come before: 14 million images, each labeled with one of nearly 22,000 categories. Then in 2012, a team from Univ of Toronto trained a #neura network on #ImageNet, achieving unprecedented performance in image recognition, dubbed #AlexNet.
https://arstechnica.com/ai/2024/11/how-a-stubborn-computer-scientist-accidentally-launched-the-deep-learning-boom/ #AI

Ars Technica · Nov 11, 2024How a stubborn computer scientist accidentally launched the deep learning boomBy Timothy B. Lee

**Stefan Bohacek** @stefan@stefanbohacek.online · Oct 18, 2024 *

Oct 18, 2024 *

Stefan Bohacek @stefan@stefanbohacek.online

Synthesizers 1896 - 2024: A Dataset and Exploratory Insights by Iftah Gabbai: "an analysis of hardware synthesizers, samplers, and drum machines"

https://www.youtube.com/watch?v=Omj405hkOt0

Dataset: https://github.com/iftah-og/Synthesizers-1896-2024

via https://www.synthtopia.com/content/2024/10/17/synthesizers-1896-2024-a-dataset-and-exploratory-insights/

YouTubeSynthesizers 1896 - 2024: A Dataset and Exploratory InsightsBy Iftah

#music #synths #data

**Tommi** @tommi@pan.rent · Oct 18, 2024

Oct 18, 2024

Tommi @tommi@pan.rent

Hey #DataScience people!

I am about to start my first “Introduction to Data Science” course at #University, and our professor asked us to team up and think about a project that we want to do.

Nevertheless, since I don’t know anything about the topic yet, I would really appreciate any tips of entry-level data science projects that I could do with #OpenData #DataSets in #Python!

Probably, we will be using #pandas. Since you’re here, any additional learning resources or general suggestions are much welcome, too!

Thanks

(Not sure how useful it is, but this is the course link: https://ois2.tlu.ee/tluois/subject/ULP6613-23265)

ois2.tlu.eeTLÜ ÕIS

#DataScience101 #AI #learning

Recent searches

Search options

Administered by:

Server stats:

#dataset