Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Open source license (wikipedia)



>  > My real question, of course, is can I train a machine learning
>  > model on that text data, and release it under a more liberal
>  > license? Assuming the model is effectively a one-way hash, and
>  > cannot reproduce the original data.
> 
> It really depends on exactly what the model does.

I was lucky enough to be at an NLP conference last week where I asked
some people this same question, and got confident replies that what I
want to do is fine. Again, people were saying the the impossibility of
reconstructing the original is the key.

>  > This is your litmus test. Can you reliably reconstruct the original
>  > text? If so, it is a derivative work.  If not, then it isn't.
> 
> That's in the ballpark, but I'm pretty sure that's not the litmus
> test.  The test is the reverse, ie, more like "if you know the
> original content, can you recognize something that probably has copied
> the expression of it?"

The models I have in mind pass that test too.

Word embeddings [1] that use multiword expressions or n-grams might be a
more interesting grey area when "n" is high enough (because the text for
each embedding is stored).  (But I'll hazard a guess that n-grams up to
at least 4 or 5 is going to be okay.) ...oh, just realized, 1-way
hashing of the text will still allow the embeddings to work, and then it
passes your other test too.

Darren

[1]: https://en.wikipedia.org/wiki/Word_embedding


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links