XML 파일을 멋진 팬더 데이터 프레임으로 변환하는 방법은 무엇입니까?
다음과 같은 XML이 있다고 가정 해 보겠습니다.
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
<documents count="N">
<document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
]]>
</document>
</documents>
</author>
이 XML 파일을 읽고 pandas DataFrame으로 변환하고 싶습니다.
key type language feature web data
e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
e95324a9a6c790ecb95e46cf15bE232ee517651 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
19e71144c50a8b9160b3cvdf2324f0955e906fce XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
21d4af9021a174f61b8erf284606c74d9e42 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
28a45eb2460823499763d70vdf9ca00ddbb665 XXX EN xx www.foo_bar_exmaple.com A large text with lots of strings and punctuations symbols [...]
이것은 이미 시도한 것이지만 오류가 발생하고이 작업을 수행하는 더 효율적인 방법이있을 수 있습니다.
from lxml import objectify
import pandas as pd
path = 'file_path'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('key','type', 'language', 'feature', 'web', 'data'))
for i in range(0,len(xml)):
obj = root.getchildren()[i].getchildren()
row = dict(zip(['key','type', 'language', 'feature', 'web', 'data'], [obj[0].text, obj[1].text]))
row_s = pd.Series(row)
row_s.name = i
df = df.append(row_s)
누구든지이 문제에 대해 더 나은 접근 방법을 제공 할 수 있습니까?
xml
Python 표준 라이브러리에서 쉽게 사용 하여 pandas.DataFrame
. 내가 할 일은 다음과 같습니다 (파일에서 읽을 때 xml_data
파일 또는 파일 객체의 이름으로 바꿉니다).
import pandas as pd
import xml.etree.ElementTree as ET
import io
def iter_docs(author):
author_attr = author.attrib
for doc in author.iter('document'):
doc_dict = author_attr.copy()
doc_dict.update(doc.attrib)
doc_dict['data'] = doc.text
yield doc_dict
xml_data = io.StringIO(u'''\
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
<documents count="N">
<document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
</document>
<document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
]]>
</document>
</documents>
</author>
''')
etree = ET.parse(xml_data) #create an ElementTree object
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
원본 문서에 여러 명의 작성자가 있거나 XML의 루트가가 아닌 경우 author
다음 생성기를 추가합니다.
def iter_author(etree):
for author in etree.iter('author'):
for row in iter_docs(author):
yield row
변경 doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
에doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree
tutorial provided in the xml
library documentation.
Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.
import pandas as pd
import xml.etree.ElementTree as ET
xml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n <code>\n 200\n </code>\n </head>\n <body>\n <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'
etree = ET.fromstring(xml_str)
dfcols = ['id', 'name']
df = pd.DataFrame(columns=dfcols)
for i in etree.iter(tag='data'):
df = df.append(
pd.Series([i.get('id'), i.get('name')], index=dfcols),
ignore_index=True)
df.head()
You can also convert by creating a dictionary of elements and then directly converting to a data frame:
import xml.etree.ElementTree as ET
import pandas as pd
# Contents of test.xml
# <?xml version="1.0" encoding="utf-8"?> <tags> <row Id="1" TagName="bayesian" Count="4699" ExcerptPostId="20258" WikiPostId="20257" /> <row Id="2" TagName="prior" Count="598" ExcerptPostId="62158" WikiPostId="62157" /> <row Id="3" TagName="elicitation" Count="10" /> <row Id="5" TagName="open-source" Count="16" /> </tags>
root = ET.parse('test.xml').getroot()
tags = {"tags":[]}
for elem in root:
tag = {}
tag["Id"] = elem.attrib['Id']
tag["TagName"] = elem.attrib['TagName']
tag["Count"] = elem.attrib['Count']
tags["tags"]. append(tag)
df_users = pd.DataFrame(tags["tags"])
df_users.head()
ReferenceURL : https://stackoverflow.com/questions/28259301/how-to-convert-an-xml-file-to-nice-pandas-dataframe
'programing' 카테고리의 다른 글
Mercurial에 어떤 VisualStudio 추가 기능이 있습니까? (0) | 2021.01.15 |
---|---|
일반적인 Maven 프로젝트에서 프로젝트 문서를 어디에 저장해야합니까? (0) | 2021.01.15 |
Github Desktop에서 로컬 브랜치를 어떻게 삭제하나요? (0) | 2021.01.15 |
MongoDB 'count ()'는 매우 느립니다. (0) | 2021.01.15 |
S3 버킷을 EC2 인스턴스에 마운트하고 PHP로 작성하려면 어떻게해야합니까? (0) | 2021.01.15 |