
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Open source license (wikipedia)
> > My real question, of course, is can I train a machine learning
> > model on that text data, and release it under a more liberal
> > license? Assuming the model is effectively a one-way hash, and
> > cannot reproduce the original data.
>
> It really depends on exactly what the model does.
I was lucky enough to be at an NLP conference last week where I asked
some people this same question, and got confident replies that what I
want to do is fine. Again, people were saying the the impossibility of
reconstructing the original is the key.
> > This is your litmus test. Can you reliably reconstruct the original
> > text? If so, it is a derivative work. If not, then it isn't.
>
> That's in the ballpark, but I'm pretty sure that's not the litmus
> test. The test is the reverse, ie, more like "if you know the
> original content, can you recognize something that probably has copied
> the expression of it?"
The models I have in mind pass that test too.
Word embeddings [1] that use multiword expressions or n-grams might be a
more interesting grey area when "n" is high enough (because the text for
each embedding is stored). (But I'll hazard a guess that n-grams up to
at least 4 or 5 is going to be okay.) ...oh, just realized, 1-way
hashing of the text will still allow the embeddings to work, and then it
passes your other test too.
Darren
[1]: https://en.wikipedia.org/wiki/Word_embedding
Home |
Main Index |
Thread Index